date:20160120

Re: HBase 0.98.0 with Spark 1.5.3 issue in yarn-cluster mode

2016-01-20 Thread Ajinkya Kale

Unfortunately I cannot at this moment (not a decision I can make) :(

On Wed, Jan 20, 2016 at 6:46 PM Ted Yu  wrote:

> I am not aware of a workaround.
>
> Can you upgrade to 0.98.4+ release ?
>
> Cheers
>
> On Wed, Jan 20, 2016 at 6:26 PM, Ajinkya Kale 
> wrote:
>
>> Hi Ted,
>>
>> Thanks for responding.
>> Is there a work around for 0.98.0 ? Adding the hbase-protocol jar to
>> HADOOP_CLASSPATH didnt work for me.
>>
>> On Wed, Jan 20, 2016 at 6:14 PM Ted Yu  wrote:
>>
>>> 0.98.0 didn't have fix from HBASE-8
>>>
>>> Please upgrade your hbase version and try again.
>>>
>>> If still there is problem, please pastebin the stack trace.
>>>
>>> Thanks
>>>
>>> On Wed, Jan 20, 2016 at 5:41 PM, Ajinkya Kale 
>>> wrote:
>>>

 I have posted this on hbase user list but i thought makes more sense on
 spark user list.
 I am able to read the table in yarn-client mode from spark-shell but I
 have exhausted all online forums for options to get it working in the
 yarn-cluster mode through spark-submit.

 I am using this code-example
 http://www.vidyasource.com/blog/Programming/Scala/Java/Data/Hadoop/Analytics/2014/01/25/lighting-a-spark-with-hbase
  to
 read a hbase table using Spark with the only change of adding the
 hbase.zookeeper.quorum through code as it is not picking it from the
 hbase-site.xml.

 Spark 1.5.3

 HBase 0.98.0


 Facing this error -

  16/01/20 12:56:59 WARN 
 client.ConnectionManager$HConnectionImplementation: Encountered problems 
 when prefetch hbase:meta table:
 org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
 attempts=3, exceptions:Wed Jan 20 12:56:58 GMT-07:00 2016, 
 org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, 
 java.lang.IllegalAccessError: class 
 com.google.protobuf.HBaseZeroCopyByteString cannot access its superclass 
 com.google.protobuf.LiteralByteStringWed Jan 20 12:56:58 GMT-07:00 2016, 
 org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, 
 java.lang.IllegalAccessError: 
 com/google/protobuf/HBaseZeroCopyByteStringWed Jan 20 12:56:59 GMT-07:00 
 2016, org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, 
 java.lang.IllegalAccessError: com/google/protobuf/HBaseZeroCopyByteString

 at 
 org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:136)
 at 
 org.apache.hadoop.hbase.client.HTable.getRowOrBefore(HTable.java:751)
 at 
 org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:147)
 at 
 org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.prefetchRegionCache(ConnectionManager.java:1215)
 at 
 org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1280)
 at 
 org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1128)
 at 
 org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:)
 at 
 org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1070)
 at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:347)
 at org.apache.hadoop.hbase.client.HTable.(HTable.java:201)
 at org.apache.hadoop.hbase.client.HTable.(HTable.java:159)
 at 
 org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:101)
 at 
 org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:111)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
 at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1281)
 at 
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
 at 
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
 at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
 at org.apache.spark.rdd.RDD.take(RDD.scala:1276)

 I tried adding the hbase protocol jar on spar-defaults.conf and in the
 driver-classpath as suggested here
 http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-IllegalAccessError-class-com-google-protobuf-HBaseZeroCopyByteString-cannot-access-its-supg-td24303.html
  but
 no success.
 Any suggestions ?


>>>
>

--driver-java-options not support multiple JVM configuration ?

2016-01-20 Thread our...@cnsuning.com

hi all;
 --driver-java-options not support multiple JVM configuration.

the submot as following:

Cores=16 
sparkdriverextraJavaOptions="-XX:newsize=2096m -XX:MaxPermSize=512m 
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseP 
arNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=80 
-XX:GCTimeLimit=5 -XX:GCHeapFreeLimit=95" 
main1=com.suning.spark.streaming.ppsc.RecommendBasedShoppingCart 
spark-submit --deploy-mode cluster \ 
--total-executor-cores $Cores \ 
--executor-memory 8g \ 
--driver-memory 16g \ 
--conf spark.driver.cores=4 \ 
--driver-java-options $sparkdriverextraJavaOptions \ 
--class $main1 \ 
hdfs:///user/bdapp/$appjars 

error :
  Error: Unrecognized option '-XX:MaxPermSize=512m  ;


when change to :

sparkdriverextraJavaOptions="-XX:newsize=2096m,-XX:MaxPermSize=512m,-XX:+PrintGCDetails,-XX:+PrintGCTimeStamps,-XX:+UseParNewGC,-XX:+UseConcMarkSweepGC,-XX:CMSInitiatingOccupancyFraction=80,-XX:GCTimeLimit=5,-XX:GCHeapFreeLimit=95"

the driver errors is :
Unrecognized VM option 
'newsize=2096m,-XX:MaxPermSize=512m,-XX:+PrintGCDetails,-XX:+PrintGCTimeStamps,-XX:+UseParNewGC,-XX:+UseConcMarkSweepGC,-XX:CMSInitiatingOccupancyFraction=80,-XX:GCTimeLimit=5,-XX:GCHeapFreeLimit=95

HBase 0.98.0 with Spark 1.5.3 issue in yarn-cluster mode

2016-01-20 Thread Ajinkya Kale

I have posted this on hbase user list but i thought makes more sense on
spark user list.
I am able to read the table in yarn-client mode from spark-shell but I have
exhausted all online forums for options to get it working in the
yarn-cluster mode through spark-submit.

I am using this code-example
http://www.vidyasource.com/blog/Programming/Scala/Java/Data/Hadoop/Analytics/2014/01/25/lighting-a-spark-with-hbase
to
read a hbase table using Spark with the only change of adding the
hbase.zookeeper.quorum through code as it is not picking it from the
hbase-site.xml.

Spark 1.5.3

HBase 0.98.0


Facing this error -

 16/01/20 12:56:59 WARN
client.ConnectionManager$HConnectionImplementation: Encountered
problems when prefetch hbase:meta table:
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
attempts=3, exceptions:Wed Jan 20 12:56:58 GMT-07:00 2016,
org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e,
java.lang.IllegalAccessError: class
com.google.protobuf.HBaseZeroCopyByteString cannot access its
superclass com.google.protobuf.LiteralByteStringWed Jan 20 12:56:58
GMT-07:00 2016,
org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e,
java.lang.IllegalAccessError:
com/google/protobuf/HBaseZeroCopyByteStringWed Jan 20 12:56:59
GMT-07:00 2016,
org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e,
java.lang.IllegalAccessError:
com/google/protobuf/HBaseZeroCopyByteString

at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:136)
at org.apache.hadoop.hbase.client.HTable.getRowOrBefore(HTable.java:751)
at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:147)
at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.prefetchRegionCache(ConnectionManager.java:1215)
at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1280)
at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1128)
at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:)
at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1070)
at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:347)
at org.apache.hadoop.hbase.client.HTable.(HTable.java:201)
at org.apache.hadoop.hbase.client.HTable.(HTable.java:159)
at 
org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:101)
at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:111)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1281)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
at org.apache.spark.rdd.RDD.take(RDD.scala:1276)

I tried adding the hbase protocol jar on spar-defaults.conf and in the
driver-classpath as suggested here
http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-IllegalAccessError-class-com-google-protobuf-HBaseZeroCopyByteString-cannot-access-its-supg-td24303.html
but
no success.
Any suggestions ?

Re: Re: --driver-java-options not support multiple JVM configuration ?

2016-01-20 Thread our...@cnsuning.com

Marcelo, 
error also exists  with quotes around "$sparkdriverextraJavaOptions":
Unrecognized VM option 
'newsize=2096m,-XX:MaxPermSize=512m,-XX:+PrintGCDetails,-XX:+PrintGCTimeStamps,-XX:+UseParNewGC,-XX:+UseConcMarkSweepGC,-XX:CMSInitiatingOccupancyFraction=80,-XX:GCTimeLimit=5,-XX:GCHeapFreeLimit=95'


 
From: Marcelo Vanzin
Date: 2016-01-21 12:09
To: our...@cnsuning.com
CC: user
Subject: Re: --driver-java-options not support multiple JVM configuration ?
On Wed, Jan 20, 2016 at 7:38 PM, our...@cnsuning.com
 wrote:
> --driver-java-options $sparkdriverextraJavaOptions \
 
You need quotes around "$sparkdriverextraJavaOptions".
 
-- 
Marcelo
 
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark + Sentry + Kerberos don't add up?

2016-01-20 Thread Ruslan Dautkhanov

I took liberty and created a JIRA https://github.com/cloudera/livy/issues/36
Feel free to close it if doesn't belong to Livy project.
I really don't know if this is a Spark or a Livy/Sentry problem.

Any ideas for possible workarounds?

Thank you.



-- 
Ruslan Dautkhanov

On Mon, Jan 18, 2016 at 4:25 PM, Ruslan Dautkhanov 
wrote:

> Hi Romain,
>
> Thank you for your response.
>
> Adding Kerberos support might be as simple as
> https://issues.cloudera.org/browse/LIVY-44 ? I.e. add Livy --principal
> and --keytab parameters to be passed to spark-submit.
>
> As a workaround I just did kinit (using hues' keytab) and then launched
> Livy Server. It probably will work as long as kerberos ticket doesn't
> expire. That's it would be great to have support for --principal and
> --keytab parameters for spark-submit as explined in
> http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cm_sg_yarn_long_jobs.html
>
>
> The only problem I have currently is the above error stack in my previous
> email:
>
> The Spark session could not be created in the cluster:
>> at org.apache.hadoop.security.*UserGroupInformation.doAs*(
>> UserGroupInformation.java:1671)
>> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(
>> SparkSubmit.scala:160)
>
>
>
> >> AFAIK Hive impersonation should be turned off when using Sentry
>
> Yep, exactly. That's what I did. It is disabled now. But looks like on
> other hand, Spark or Spark Notebook want to have that enabled?
> It tries to do org.apache.hadoop.security.UserGroupInformation.doAs()
> hence the error.
>
> So Sentry isn't compatible with Spark in kerberized clusters? Is any
> workaround for this problem?
>
>
> --
> Ruslan Dautkhanov
>
> On Mon, Jan 18, 2016 at 3:52 PM, Romain Rigaux 
> wrote:
>
>> Livy does not support any Kerberos yet
>> https://issues.cloudera.org/browse/LIVY-3
>>
>> Are you focusing instead about HS2 + Kerberos with Sentry?
>>
>> AFAIK Hive impersonation should be turned off when using Sentry:
>> http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/sg_sentry_service_config.html
>>
>> On Sun, Jan 17, 2016 at 10:04 PM, Ruslan Dautkhanov > > wrote:
>>
>>> Getting following error stack
>>>
>>> The Spark session could not be created in the cluster:
 at org.apache.hadoop.security.*UserGroupInformation.doAs*
 (UserGroupInformation.java:1671)
 at
 org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:160)
 at
 org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) )
 at org.*apache.hadoop.hive.metastore.HiveMetaStoreClient*
 .open(HiveMetaStoreClient.java:466)
 at
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:234)
 at
 org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
 ... 35 more
>>>
>>>
>>> My understanding that hive.server2.enable.impersonation and
>>> hive.server2.enable.doAs should be enabled to make
>>> UserGroupInformation.doAs() work?
>>>
>>> When I try to enable these parameters, Cloudera Manager shows error
>>>
>>> Hive Impersonation is enabled for Hive Server2 role 'HiveServer2
 (hostname)'.
 Hive Impersonation should be disabled to enable Hive authorization
 using Sentry
>>>
>>>
>>> So Spark-Hive conflicts with Sentry!?
>>>
>>> Environment: Hue 3.9 Spark Notebooks + Livy Server (built from master).
>>> CDH 5.5.
>>>
>>> This is a kerberized cluster with Sentry.
>>>
>>> I was using hue's keytab as hue user is normally (by default in CDH) is
>>> allowed to impersonate to other users.
>>> So very convenient for Spark Notebooks.
>>>
>>> Any information to help solve this will be highly appreciated.
>>>
>>>
>>> --
>>> Ruslan Dautkhanov
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "Hue-Users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to hue-user+unsubscr...@cloudera.org.
>>>
>>
>>
>

Re: spark task scheduling delay

2016-01-20 Thread Renu Yadav

Any suggestions?

On Wed, Jan 20, 2016 at 6:50 PM, Renu Yadav  wrote:

> Hi ,
>
> I am facing spark   task scheduling delay issue in spark 1.4.
>
> suppose I have 1600 tasks running then 1550 tasks runs fine but for the
> remaining 50 i am facing task delay even if the input size of these task is
> same as the above 1550 tasks
>
> Please suggest some solution.
>
> Thanks & Regards
> Renu Yadav
>

Re: HBase 0.98.0 with Spark 1.5.3 issue in yarn-cluster mode

2016-01-20 Thread Ted Yu

0.98.0 didn't have fix from HBASE-8

Please upgrade your hbase version and try again.

If still there is problem, please pastebin the stack trace.

Thanks

On Wed, Jan 20, 2016 at 5:41 PM, Ajinkya Kale  wrote:

>
> I have posted this on hbase user list but i thought makes more sense on
> spark user list.
> I am able to read the table in yarn-client mode from spark-shell but I
> have exhausted all online forums for options to get it working in the
> yarn-cluster mode through spark-submit.
>
> I am using this code-example
> http://www.vidyasource.com/blog/Programming/Scala/Java/Data/Hadoop/Analytics/2014/01/25/lighting-a-spark-with-hbase
>  to
> read a hbase table using Spark with the only change of adding the
> hbase.zookeeper.quorum through code as it is not picking it from the
> hbase-site.xml.
>
> Spark 1.5.3
>
> HBase 0.98.0
>
>
> Facing this error -
>
>  16/01/20 12:56:59 WARN client.ConnectionManager$HConnectionImplementation: 
> Encountered problems when prefetch hbase:meta table:
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
> attempts=3, exceptions:Wed Jan 20 12:56:58 GMT-07:00 2016, 
> org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, 
> java.lang.IllegalAccessError: class 
> com.google.protobuf.HBaseZeroCopyByteString cannot access its superclass 
> com.google.protobuf.LiteralByteStringWed Jan 20 12:56:58 GMT-07:00 2016, 
> org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, 
> java.lang.IllegalAccessError: com/google/protobuf/HBaseZeroCopyByteStringWed 
> Jan 20 12:56:59 GMT-07:00 2016, 
> org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, 
> java.lang.IllegalAccessError: com/google/protobuf/HBaseZeroCopyByteString
>
> at 
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:136)
> at org.apache.hadoop.hbase.client.HTable.getRowOrBefore(HTable.java:751)
> at 
> org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:147)
> at 
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.prefetchRegionCache(ConnectionManager.java:1215)
> at 
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1280)
> at 
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1128)
> at 
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:)
> at 
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1070)
> at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:347)
> at org.apache.hadoop.hbase.client.HTable.(HTable.java:201)
> at org.apache.hadoop.hbase.client.HTable.(HTable.java:159)
> at 
> org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:101)
> at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:111)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1281)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
> at org.apache.spark.rdd.RDD.take(RDD.scala:1276)
>
> I tried adding the hbase protocol jar on spar-defaults.conf and in the
> driver-classpath as suggested here
> http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-IllegalAccessError-class-com-google-protobuf-HBaseZeroCopyByteString-cannot-access-its-supg-td24303.html
>  but
> no success.
> Any suggestions ?
>
>

Re: Window Functions importing issue in Spark 1.4.0

2016-01-20 Thread satish chandra j

Hi Ted,
Thanks for sharing the link on rowNumber example on usage

Could you please let me know if I could use rowNumber window function in my
currenct Spark 1.4.0 version
If yes, than why am I getting error in "import org.apache.spark.sql.
expressions.Window" and "import org.apache.spark.sql.functions.rowNumber"

Thanks for providing your valuable inputs

Regards,
Satish Chandra J

On Thu, Jan 7, 2016 at 4:41 PM, Ted Yu  wrote:

> Please take a look at the following for sample on how rowNumber is used:
> https://github.com/apache/spark/pull/9050
>
> BTW 1.4.0 was an old release.
>
> Please consider upgrading.
>
> On Thu, Jan 7, 2016 at 3:04 AM, satish chandra j  > wrote:
>
>> HI All,
>> Currently using Spark 1.4.0 version, I have a requirement to add a column
>> having Sequential Numbering to an existing DataFrame
>> I understand Window Function "rowNumber" serves my purpose
>> hence I have below import statements to include the same
>>
>> import org.apache.spark.sql.expressions.Window
>> import org.apache.spark.sql.functions.rowNumber
>>
>> But I am getting an error at the import statement itself such as
>> "object expressions is not a member of package org.apache.spark.sql"
>>
>> "value rowNumber is not a member of object org.apache.spark.sql.functions"
>>
>> Could anybody throw some light if any to fix the issue
>>
>> Regards,
>> Satish Chandra
>>
>
>

Re: HBase 0.98.0 with Spark 1.5.3 issue in yarn-cluster mode

2016-01-20 Thread Ajinkya Kale

Hi Ted,

Thanks for responding.
Is there a work around for 0.98.0 ? Adding the hbase-protocol jar to
HADOOP_CLASSPATH didnt work for me.

On Wed, Jan 20, 2016 at 6:14 PM Ted Yu  wrote:

> 0.98.0 didn't have fix from HBASE-8
>
> Please upgrade your hbase version and try again.
>
> If still there is problem, please pastebin the stack trace.
>
> Thanks
>
> On Wed, Jan 20, 2016 at 5:41 PM, Ajinkya Kale 
> wrote:
>
>>
>> I have posted this on hbase user list but i thought makes more sense on
>> spark user list.
>> I am able to read the table in yarn-client mode from spark-shell but I
>> have exhausted all online forums for options to get it working in the
>> yarn-cluster mode through spark-submit.
>>
>> I am using this code-example
>> http://www.vidyasource.com/blog/Programming/Scala/Java/Data/Hadoop/Analytics/2014/01/25/lighting-a-spark-with-hbase
>>  to
>> read a hbase table using Spark with the only change of adding the
>> hbase.zookeeper.quorum through code as it is not picking it from the
>> hbase-site.xml.
>>
>> Spark 1.5.3
>>
>> HBase 0.98.0
>>
>>
>> Facing this error -
>>
>>  16/01/20 12:56:59 WARN client.ConnectionManager$HConnectionImplementation: 
>> Encountered problems when prefetch hbase:meta table:
>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
>> attempts=3, exceptions:Wed Jan 20 12:56:58 GMT-07:00 2016, 
>> org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, 
>> java.lang.IllegalAccessError: class 
>> com.google.protobuf.HBaseZeroCopyByteString cannot access its superclass 
>> com.google.protobuf.LiteralByteStringWed Jan 20 12:56:58 GMT-07:00 2016, 
>> org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, 
>> java.lang.IllegalAccessError: com/google/protobuf/HBaseZeroCopyByteStringWed 
>> Jan 20 12:56:59 GMT-07:00 2016, 
>> org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, 
>> java.lang.IllegalAccessError: com/google/protobuf/HBaseZeroCopyByteString
>>
>> at 
>> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:136)
>> at org.apache.hadoop.hbase.client.HTable.getRowOrBefore(HTable.java:751)
>> at 
>> org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:147)
>> at 
>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.prefetchRegionCache(ConnectionManager.java:1215)
>> at 
>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1280)
>> at 
>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1128)
>> at 
>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:)
>> at 
>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1070)
>> at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:347)
>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:201)
>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:159)
>> at 
>> org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:101)
>> at 
>> org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:111)
>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>> at scala.Option.getOrElse(Option.scala:120)
>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>> at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1281)
>> at 
>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>> at 
>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>> at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
>> at org.apache.spark.rdd.RDD.take(RDD.scala:1276)
>>
>> I tried adding the hbase protocol jar on spar-defaults.conf and in the
>> driver-classpath as suggested here
>> http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-IllegalAccessError-class-com-google-protobuf-HBaseZeroCopyByteString-cannot-access-its-supg-td24303.html
>>  but
>> no success.
>> Any suggestions ?
>>
>>
>

Re: --driver-java-options not support multiple JVM configuration ?

2016-01-20 Thread Marcelo Vanzin

On Wed, Jan 20, 2016 at 7:38 PM, our...@cnsuning.com
 wrote:
> --driver-java-options $sparkdriverextraJavaOptions \

You need quotes around "$sparkdriverextraJavaOptions".

-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: HBase 0.98.0 with Spark 1.5.3 issue in yarn-cluster mode

2016-01-20 Thread Ted Yu

I am not aware of a workaround.

Can you upgrade to 0.98.4+ release ?

Cheers

On Wed, Jan 20, 2016 at 6:26 PM, Ajinkya Kale  wrote:

> Hi Ted,
>
> Thanks for responding.
> Is there a work around for 0.98.0 ? Adding the hbase-protocol jar to
> HADOOP_CLASSPATH didnt work for me.
>
> On Wed, Jan 20, 2016 at 6:14 PM Ted Yu  wrote:
>
>> 0.98.0 didn't have fix from HBASE-8
>>
>> Please upgrade your hbase version and try again.
>>
>> If still there is problem, please pastebin the stack trace.
>>
>> Thanks
>>
>> On Wed, Jan 20, 2016 at 5:41 PM, Ajinkya Kale 
>> wrote:
>>
>>>
>>> I have posted this on hbase user list but i thought makes more sense on
>>> spark user list.
>>> I am able to read the table in yarn-client mode from spark-shell but I
>>> have exhausted all online forums for options to get it working in the
>>> yarn-cluster mode through spark-submit.
>>>
>>> I am using this code-example
>>> http://www.vidyasource.com/blog/Programming/Scala/Java/Data/Hadoop/Analytics/2014/01/25/lighting-a-spark-with-hbase
>>>  to
>>> read a hbase table using Spark with the only change of adding the
>>> hbase.zookeeper.quorum through code as it is not picking it from the
>>> hbase-site.xml.
>>>
>>> Spark 1.5.3
>>>
>>> HBase 0.98.0
>>>
>>>
>>> Facing this error -
>>>
>>>  16/01/20 12:56:59 WARN client.ConnectionManager$HConnectionImplementation: 
>>> Encountered problems when prefetch hbase:meta table:
>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
>>> attempts=3, exceptions:Wed Jan 20 12:56:58 GMT-07:00 2016, 
>>> org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, 
>>> java.lang.IllegalAccessError: class 
>>> com.google.protobuf.HBaseZeroCopyByteString cannot access its superclass 
>>> com.google.protobuf.LiteralByteStringWed Jan 20 12:56:58 GMT-07:00 2016, 
>>> org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, 
>>> java.lang.IllegalAccessError: 
>>> com/google/protobuf/HBaseZeroCopyByteStringWed Jan 20 12:56:59 GMT-07:00 
>>> 2016, org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, 
>>> java.lang.IllegalAccessError: com/google/protobuf/HBaseZeroCopyByteString
>>>
>>> at 
>>> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:136)
>>> at org.apache.hadoop.hbase.client.HTable.getRowOrBefore(HTable.java:751)
>>> at 
>>> org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:147)
>>> at 
>>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.prefetchRegionCache(ConnectionManager.java:1215)
>>> at 
>>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1280)
>>> at 
>>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1128)
>>> at 
>>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:)
>>> at 
>>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1070)
>>> at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:347)
>>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:201)
>>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:159)
>>> at 
>>> org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:101)
>>> at 
>>> org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:111)
>>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>>> at scala.Option.getOrElse(Option.scala:120)
>>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>>> at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1281)
>>> at 
>>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>>> at 
>>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>>> at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
>>> at org.apache.spark.rdd.RDD.take(RDD.scala:1276)
>>>
>>> I tried adding the hbase protocol jar on spar-defaults.conf and in the
>>> driver-classpath as suggested here
>>> http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-IllegalAccessError-class-com-google-protobuf-HBaseZeroCopyByteString-cannot-access-its-supg-td24303.html
>>>  but
>>> no success.
>>> Any suggestions ?
>>>
>>>
>>

best practice : how to manage your Spark cluster ?

2016-01-20 Thread charles li

I've put a thread before:  pre-install 3-party Python package on spark
cluster

currently I use *Fabric* to manage my cluster , but it's not enough for me,
and I believe there is a much better way to *manage and monitor* the
cluster.

I believe there really exists some open source manage tools which provides
a web UI allowing me to [ what I need exactly ]:


   - monitor the cluster machine's state in real-time, say memory, network,
   disk
   - list all the services, packages on each machine
   - install / uninstall / upgrade / downgrade package through a web UI
   - start / stop / restart services on that machine



great thanks

-- 
*--*
a spark lover, a quant, a developer and a good man.

http://github.com/litaotao

??????retrieve cell value from a rowMatrix.

2016-01-20 Thread zhangjp

use apply(i,j) function.
 can u know how  to save matrix to a file using java language?

 

 --  --
  ??: "Srivathsan Srinivas";;
 : 2016??1??21??(??) 9:04
 ??: "user"; 
 
 : retrieve cell value from a rowMatrix.

 

 Hi, Is there a way to retrieve the cell value of a rowMatrix? Like m(i,j)? 
The docs say that the indices are long. Maybe I am doing something wrong...but, 
there doesn't seem to be any such direct method.
 

 Any suggestions?
 

-- 
 Thanks,
Srini.

Re: visualize data from spark streaming

2016-01-20 Thread Silvio Fiorito

You’ve got a few options:

  *   Use a notebook tool such as Zeppelin, Jupyter, or Spark Notebook to write 
up some visualizations which update in time with your streaming batches
  *   Use Spark Streaming to push your batch results to another 3rd-party 
system with a BI tool that supports realtime updates such as ZoomData or Power 
BI
  *   Write your own using many of the tools that support a reactive model to 
push updates to a d3.js web front end

I’ve had the opportunity to work on all 3 of the options above and it just 
depends on your timeframe, requirements, and budget. For instance, the 
notebooks are good for engineers or data scientists but not something you’d 
necessarily put in front of a non-technical end-user. The BI tools on the other 
hand may be less customizable but more approachable by business users.

Thanks,
Silvio

On 1/20/16, 2:54 PM, "patcharee" 
> wrote:

Hi,

How to visualize realtime data (in graph/chart) from spark streaming?
Any tools?

Best,
Patcharee

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.org

Re: Create a n x n graph given only the vertices no

2016-01-20 Thread praveen S

Hi Robin,

I am using Spark 1.3 and I am not able to find the api
Graph.fromEdgeTuples(edge RDD, 1)

Regards,
Praveen
Well you can use a similar tech to generate an RDD[(Long, Long)] (that’s
what the edges variable is) and then create the Graph using
Graph.fromEdgeTuples.
---
Robin East
*Spark GraphX in Action* Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action





On 11 Jan 2016, at 12:30, praveen S  wrote:

Yes I was looking something of that sort..  Thank you.

Actually I was looking for a way to connect nodes based on the property of
the nodes.. I have a set of nodes and I know the condition on which I can
create an edge..
On 11 Jan 2016 14:06, "Robin East"  wrote:

> Do you mean to create a perfect graph? You can do it like this:
>
> scala> import org.apache.spark.graphx._
> import org.apache.spark.graphx._
>
> scala> val vert = sc.parallelize(Seq(1L,2L,3L,4L,5L))
> vert: org.apache.spark.rdd.RDD[Long] = ParallelCollectionRDD[23] at
> parallelize at :36
>
> scala> val edges = vert.cartesian(vert)
> edges: org.apache.spark.rdd.RDD[(Long, Long)] = CartesianRDD[24] at
> cartesian at :38
>
> scala> val g = Graph.fromEdgeTuples(edges, 1)
>
>
> That will give you a graph where all vertices connect to every other
> vertices. it will give you self joins as well i.e. vert 1 has an edge to
> vert 1, you can deal with this using a filter on the edges RDD if you don’t
> want self-joins
>
> --
> Robin East
> *Spark GraphX in Action* Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action
>
>
>
>
>
> On 11 Jan 2016, at 03:19, praveen S  wrote:
>
> Is it possible in graphx to create/generate graph of n x n given only the
> vertices.
> On 8 Jan 2016 23:57, "praveen S"  wrote:
>
>> Is it possible in graphx to create/generate a graph n x n given n
>> vertices?
>>
>
>

Re: Create a n x n graph given only the vertices no

2016-01-20 Thread praveen S

Sorry.. Found the api..
On 21 Jan 2016 10:17, "praveen S"  wrote:

> Hi Robin,
>
> I am using Spark 1.3 and I am not able to find the api
> Graph.fromEdgeTuples(edge RDD, 1)
>
> Regards,
> Praveen
> Well you can use a similar tech to generate an RDD[(Long, Long)] (that’s
> what the edges variable is) and then create the Graph using
> Graph.fromEdgeTuples.
>
> ---
> Robin East
> *Spark GraphX in Action* Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action
>
>
>
>
>
> On 11 Jan 2016, at 12:30, praveen S  wrote:
>
> Yes I was looking something of that sort..  Thank you.
>
> Actually I was looking for a way to connect nodes based on the property of
> the nodes.. I have a set of nodes and I know the condition on which I can
> create an edge..
> On 11 Jan 2016 14:06, "Robin East"  wrote:
>
>> Do you mean to create a perfect graph? You can do it like this:
>>
>> scala> import org.apache.spark.graphx._
>> import org.apache.spark.graphx._
>>
>> scala> val vert = sc.parallelize(Seq(1L,2L,3L,4L,5L))
>> vert: org.apache.spark.rdd.RDD[Long] = ParallelCollectionRDD[23] at
>> parallelize at :36
>>
>> scala> val edges = vert.cartesian(vert)
>> edges: org.apache.spark.rdd.RDD[(Long, Long)] = CartesianRDD[24] at
>> cartesian at :38
>>
>> scala> val g = Graph.fromEdgeTuples(edges, 1)
>>
>>
>> That will give you a graph where all vertices connect to every other
>> vertices. it will give you self joins as well i.e. vert 1 has an edge to
>> vert 1, you can deal with this using a filter on the edges RDD if you don’t
>> want self-joins
>>
>> --
>> Robin East
>> *Spark GraphX in Action* Michael Malak and Robin East
>> Manning Publications Co.
>> http://www.manning.com/books/spark-graphx-in-action
>>
>>
>>
>>
>>
>> On 11 Jan 2016, at 03:19, praveen S  wrote:
>>
>> Is it possible in graphx to create/generate graph of n x n given only the
>> vertices.
>> On 8 Jan 2016 23:57, "praveen S"  wrote:
>>
>>> Is it possible in graphx to create/generate a graph n x n given n
>>> vertices?
>>>
>>
>>
>

Re: Parquet write optimization by row group size config

2016-01-20 Thread Akhil Das

It would be good if you can share the code, someone here or I can guide you
better if you can post the code snippet.

Thanks
Best Regards

On Wed, Jan 20, 2016 at 10:54 PM, Pavel Plotnikov <
pavel.plotni...@team.wrike.com> wrote:

> Thanks, Akhil! It helps, but this jobs still not fast enough, maybe i
> missed something
>
> Regards,
> Pavel
>
> On Wed, Jan 20, 2016 at 9:51 AM Akhil Das 
> wrote:
>
>> Did you try re-partitioning the data before doing the write?
>>
>> Thanks
>> Best Regards
>>
>> On Tue, Jan 19, 2016 at 6:13 PM, Pavel Plotnikov <
>> pavel.plotni...@team.wrike.com> wrote:
>>
>>> Hello,
>>> I'm using spark on some machines in standalone mode, data storage is
>>> mounted on this machines via nfs. A have input data stream and when i'm
>>> trying to store all data for hour in parquet, a job executes mostly on one
>>> core and this hourly data are stored in 40- 50 minutes. It is very slow!
>>> And it is not IO problem. After research how parquet file works, i'm found
>>> that it can be parallelized on row group abstraction level.
>>> I think row group for my files is to large, and how can i change it?
>>> When i create to big DataFrame i devides in parts very well and writes
>>> quikly!
>>>
>>> Thanks,
>>> Pavel
>>>
>>
>>

Re: a lot of warnings when build spark 1.6.0

2016-01-20 Thread Eli Super

Thanks Sean

in command : mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Dscala-2.10
-Phive -Phive-thriftserver  -DskipTests clean package

is the sting : -Phadoop-2.4 -Dhadoop.version=2.4.0 kind of duplication ?

can I use only one string defines hadoop version ?

and I don't have hadoop , I just build local spark only with csv package
and thrift server , what hadoop version to use to avoid warnings ?

Thanks a lot !

On Thu, Jan 21, 2016 at 9:08 AM, Eli Super  wrote:

> Hi
>
> I get WARNINGS when try to build spark 1.6.0
>
> overall I get SUCCESS message on all projects
>
> command I used :
>
> mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Dscala-2.10 -Phive
> -Phive-thriftserver  -DskipTests clean package
>
> from pom.xml
>
>  2.10.5
>  2.10
>
>
> example of warnings :
>
>
> [INFO]
> 
> [INFO] Building Spark Project Core 1.6.0
> [INFO]
> 
> [INFO]
> [INFO] --- maven-clean-plugin:2.6.1:clean (default-clean) @
> spark-core_2.10 ---
> [INFO] Deleting C:\spark-1.6.0\core\target
> [INFO]
> [INFO] --- maven-enforcer-plugin:1.4:enforce (enforce-versions) @
> spark-core_2.10 ---
> [INFO]
> [INFO] --- scala-maven-plugin:3.2.2:add-source (eclipse-add-source) @
> spark-core_2.10 ---
> [INFO] Add Source directory: C:\spark-1.6.0\core\src\main\scala
> [INFO] Add Test Source directory: C:\spark-1.6.0\core\src\test\scala
> [INFO]
> [INFO] --- maven-remote-resources-plugin:1.5:process (default) @
> spark-core_2.10 ---
> [INFO]
> [INFO] --- maven-resources-plugin:2.6:resources (default-resources) @
> spark-core_2.10 ---
> [INFO] Using 'UTF-8' encoding to copy filtered resources.
> [INFO] Copying 21 resources
> [INFO] Copying 3 resources
> [INFO]
> [INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @
> spark-core_2.10 ---
> [WARNING] Zinc server is not available at port 3030 - reverting to normal
> incremental compile
> [INFO] Using incremental compilation
> [INFO] Compiling 486 Scala sources and 76 Java sources to
> C:\spark-1.6.0\core\target\scala-2.10\classes...
> [WARNING]
> C:\spark-1.6.0\core\src\main\scala\org\apache\spark\storage\TachyonBlockManager.scala:104:
> value TRY_CACHE in object WriteType is deprecated: see corresponding
> Javadoc fo
> r more information.
> [WARNING] val os = file.getOutStream(WriteType.TRY_CACHE)
> [WARNING]  ^
> [WARNING]
> C:\spark-1.6.0\core\src\main\scala\org\apache\spark\storage\TachyonBlockManager.scala:118:
> value TRY_CACHE in object WriteType is deprecated: see corresponding
> Javadoc fo
> r more information.
> [WARNING] val os = file.getOutStream(WriteType.TRY_CACHE)
> [WARNING]  ^
> [WARNING]
> C:\spark-1.6.0\core\src\main\scala\org\apache\spark\SparkContext.scala:186:
> constructor SparkContext in class SparkContext is deprecated: Passing in
> preferred locations h
> as no effect at all, see SPARK-10921
> [WARNING] this(master, appName, null, Nil, Map())
> [WARNING] ^
> [WARNING]
> C:\spark-1.6.0\core\src\main\scala\org\apache\spark\SparkContext.scala:196:
> constructor SparkContext in class SparkContext is deprecated: Passing in
> preferred locations h
> as no effect at all, see SPARK-10921
> [WARNING] this(master, appName, sparkHome, Nil, Map())
> [WARNING] ^
> [WARNING]
> C:\spark-1.6.0\core\src\main\scala\org\apache\spark\SparkContext.scala:208:
> constructor SparkContext in class SparkContext is deprecated: Passing in
> preferred locations h
> as no effect at all, see SPARK-10921
> [WARNING] this(master, appName, sparkHome, jars, Map())
> [WARNING] ^
> [WARNING]
> C:\spark-1.6.0\core\src\main\scala\org\apache\spark\SparkContext.scala:871:
> constructor Job in class Job is deprecated: see corresponding Javadoc for
> more information.
> [WARNING] val job = new NewHadoopJob(hadoopConfiguration)
> [WARNING]   ^
> [WARNING]
> C:\spark-1.6.0\core\src\main\scala\org\apache\spark\SparkContext.scala:920:
> constructor Job in class Job is deprecated: see corresponding Javadoc for
> more information.
> [WARNING] val job = new NewHadoopJob(hadoopConfiguration)
> [WARNING]   ^
> [WARNING]
> C:\spark-1.6.0\core\src\main\scala\org\apache\spark\SparkContext.scala:1099:
> constructor Job in class Job is deprecated: see corresponding Javadoc for
> more information.
> [WARNING] val job = new NewHadoopJob(conf)
> [WARNING]   ^
> [WARNING]
> C:\spark-1.6.0\core\src\main\scala\org\apache\spark\SparkContext.scala:1366:
> method isDir in class FileStatus is deprecated: see corresponding Javadoc
> for more informatio
> n.
> [WARNING]   val isDir = fs.getFileStatus(hadoopPath).isDir
> [WARNING]^
> [WARNING]
> C:\spark-1.6.0\core\src\main\scala\org\apache\spark\SparkEnv.scala:101:
> value actorSystem in

Re: spark task scheduling delay

2016-01-20 Thread Stephen Boesch

Which Resource Manager  are you using?

2016-01-20 21:38 GMT-08:00 Renu Yadav :

> Any suggestions?
>
> On Wed, Jan 20, 2016 at 6:50 PM, Renu Yadav  wrote:
>
>> Hi ,
>>
>> I am facing spark   task scheduling delay issue in spark 1.4.
>>
>> suppose I have 1600 tasks running then 1550 tasks runs fine but for the
>> remaining 50 i am facing task delay even if the input size of these task is
>> same as the above 1550 tasks
>>
>> Please suggest some solution.
>>
>> Thanks & Regards
>> Renu Yadav
>>
>
>

a lot of warnings when build spark 1.6.0

2016-01-20 Thread Eli Super

Hi

I get WARNINGS when try to build spark 1.6.0

overall I get SUCCESS message on all projects

command I used :

mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Dscala-2.10 -Phive
-Phive-thriftserver  -DskipTests clean package

from pom.xml

 2.10.5
 2.10


example of warnings :


[INFO]

[INFO] Building Spark Project Core 1.6.0
[INFO]

[INFO]
[INFO] --- maven-clean-plugin:2.6.1:clean (default-clean) @ spark-core_2.10
---
[INFO] Deleting C:\spark-1.6.0\core\target
[INFO]
[INFO] --- maven-enforcer-plugin:1.4:enforce (enforce-versions) @
spark-core_2.10 ---
[INFO]
[INFO] --- scala-maven-plugin:3.2.2:add-source (eclipse-add-source) @
spark-core_2.10 ---
[INFO] Add Source directory: C:\spark-1.6.0\core\src\main\scala
[INFO] Add Test Source directory: C:\spark-1.6.0\core\src\test\scala
[INFO]
[INFO] --- maven-remote-resources-plugin:1.5:process (default) @
spark-core_2.10 ---
[INFO]
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @
spark-core_2.10 ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 21 resources
[INFO] Copying 3 resources
[INFO]
[INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @
spark-core_2.10 ---
[WARNING] Zinc server is not available at port 3030 - reverting to normal
incremental compile
[INFO] Using incremental compilation
[INFO] Compiling 486 Scala sources and 76 Java sources to
C:\spark-1.6.0\core\target\scala-2.10\classes...
[WARNING]
C:\spark-1.6.0\core\src\main\scala\org\apache\spark\storage\TachyonBlockManager.scala:104:
value TRY_CACHE in object WriteType is deprecated: see corresponding
Javadoc fo
r more information.
[WARNING] val os = file.getOutStream(WriteType.TRY_CACHE)
[WARNING]  ^
[WARNING]
C:\spark-1.6.0\core\src\main\scala\org\apache\spark\storage\TachyonBlockManager.scala:118:
value TRY_CACHE in object WriteType is deprecated: see corresponding
Javadoc fo
r more information.
[WARNING] val os = file.getOutStream(WriteType.TRY_CACHE)
[WARNING]  ^
[WARNING]
C:\spark-1.6.0\core\src\main\scala\org\apache\spark\SparkContext.scala:186:
constructor SparkContext in class SparkContext is deprecated: Passing in
preferred locations h
as no effect at all, see SPARK-10921
[WARNING] this(master, appName, null, Nil, Map())
[WARNING] ^
[WARNING]
C:\spark-1.6.0\core\src\main\scala\org\apache\spark\SparkContext.scala:196:
constructor SparkContext in class SparkContext is deprecated: Passing in
preferred locations h
as no effect at all, see SPARK-10921
[WARNING] this(master, appName, sparkHome, Nil, Map())
[WARNING] ^
[WARNING]
C:\spark-1.6.0\core\src\main\scala\org\apache\spark\SparkContext.scala:208:
constructor SparkContext in class SparkContext is deprecated: Passing in
preferred locations h
as no effect at all, see SPARK-10921
[WARNING] this(master, appName, sparkHome, jars, Map())
[WARNING] ^
[WARNING]
C:\spark-1.6.0\core\src\main\scala\org\apache\spark\SparkContext.scala:871:
constructor Job in class Job is deprecated: see corresponding Javadoc for
more information.
[WARNING] val job = new NewHadoopJob(hadoopConfiguration)
[WARNING]   ^
[WARNING]
C:\spark-1.6.0\core\src\main\scala\org\apache\spark\SparkContext.scala:920:
constructor Job in class Job is deprecated: see corresponding Javadoc for
more information.
[WARNING] val job = new NewHadoopJob(hadoopConfiguration)
[WARNING]   ^
[WARNING]
C:\spark-1.6.0\core\src\main\scala\org\apache\spark\SparkContext.scala:1099:
constructor Job in class Job is deprecated: see corresponding Javadoc for
more information.
[WARNING] val job = new NewHadoopJob(conf)
[WARNING]   ^
[WARNING]
C:\spark-1.6.0\core\src\main\scala\org\apache\spark\SparkContext.scala:1366:
method isDir in class FileStatus is deprecated: see corresponding Javadoc
for more informatio
n.
[WARNING]   val isDir = fs.getFileStatus(hadoopPath).isDir
[WARNING]^
[WARNING]
C:\spark-1.6.0\core\src\main\scala\org\apache\spark\SparkEnv.scala:101:
value actorSystem in class SparkEnv is deprecated: Actor system is no
longer supported as of 1.4.0

[WARNING] actorSystem.shutdown()
[WARNING] ^
[WARNING]
C:\spark-1.6.0\core\src\main\scala\org\apache\spark\SparkHadoopWriter.scala:153:
constructor TaskID in class TaskID is deprecated: see corresponding Javadoc
for more info
rmation.
[WARNING] new TaskAttemptID(new TaskID(jID.value, true, splitID),
attemptID))
[WARNING]   ^
[WARNING]
C:\spark-1.6.0\core\src\main\scala\org\apache\spark\SparkHadoopWriter.scala:174:
method makeQualified in class Path is deprecated: see corresponding Javadoc
for more info
rmation.
[WARNING] outputPath.makeQualified(fs)
[WARNING]^
[WARNING]

Re: How to call a custom function from GroupByKey which takes Iterable[Row] as input and returns a Map[Int,String] as output in scala

2016-01-20 Thread Neha Mehta

Hi Vishal,

Thanks for the solution. I was able to get it working for my scenario.
Regarding the Task not serializable error, I still get it when I declare a
function outside the main method. However, if I declare it inside the main
"val func = {}", it is working fine for me.

In case you have any insight to share on the same, then please do share it.

Thanks for the help.

Regards,
Neha

On Wed, Jan 20, 2016 at 11:39 AM, Vishal Maru  wrote:

> It seems Spark is not able to serialize your function code to worker nodes.
>
> I have tried to put a solution in simple set of commands. Maybe you can
> combine last four line into function.
>
>
> val arr = Array((1,"A","<20","0"), (1,"A",">20 & <40","1"), (1,"B",">20 &
> <40","0"), (1,"C",">20 & <40","0"), (1,"C",">20 & <40","0"),
> (2,"A","<20","0"), (3,"B",">20 & <40","1"), (3,"B",">40","2"))
>
> val rdd = sc.parallelize(arr)
>
> val prdd = rdd.map(a => (a._1,a))
> val totals = prdd.groupByKey.map(a => (a._1, a._2.size))
>
> var n1 = rdd.map(a => ((a._1, a._2), 1) )
> var n2 = n1.reduceByKey(_+_).map(a => (a._1._1, (a._1._2, a._2)))
> var n3 = n2.join(totals).map(a => (a._1, (a._2._1._1, a._2._1._2.toDouble
> / a._2._2)))
> var n4 = n3.map(a => (a._1, a._2._1 + ":" +
> a._2._2.toString)).reduceByKey((a, b) => a + "|" + b)
>
> n4.collect.foreach(println)
>
>
>
>
> On Mon, Jan 18, 2016 at 6:47 AM, Neha Mehta 
> wrote:
>
>> Hi,
>>
>> I have a scenario wherein my dataset has around 30 columns. It is
>> basically user activity information. I need to group the information by
>> each user and then for each column/activity parameter I need to find the
>> percentage affinity for each value in that column for that user. Below is
>> the sample input and output.
>>
>> UserId C1 C2 C3
>> 1 A <20 0
>> 1 A >20 & <40 1
>> 1 B >20 & <40 0
>> 1 C >20 & <40 0
>> 1 C >20 & <40 0
>> 2 A <20 0
>> 3 B >20 & <40 1
>> 3 B >40 2
>>
>>
>>
>>
>>
>>
>>
>>
>> Output
>>
>>
>> 1 A:0.4|B:0.2|C:0.4 <20:02|>20 & <40:0.8 0:0.8|1:0.2
>> 2 A:1 <20:1 0:01
>> 3 B:1 >20 & <40:0.5|>40:0.5 1:0.5|2:0.5
>>
>> Presently this is how I am calculating these values:
>> Group by UserId and C1 and compute values for C1 for all the users, then
>> do a group by by Userid and C2 and find the fractions for C2 for each user
>> and so on. This approach is quite slow.  Also the number of records for
>> each user will be at max 30. So I would like to take a second approach
>> wherein I do a groupByKey and pass the entire list of records for each key
>> to a function which computes all the percentages for each column for each
>> user at once. Below are the steps I am trying to follow:
>>
>> 1. Dataframe1 => group by UserId , find the counts of records for each
>> user. Join the results back to the input so that counts are available with
>> each record
>> 2. Dataframe1.map(s=>s(1),s).groupByKey().map(s=>myUserAggregator(s._2))
>>
>> def myUserAggregator(rows: Iterable[Row]):
>> scala.collection.mutable.Map[Int,String] = {
>> val returnValue = scala.collection.mutable.Map[Int,String]()
>> if (rows != null) {
>>   val activityMap = scala.collection.mutable.Map[Int,
>> scala.collection.mutable.Map[String,
>> Int]]().withDefaultValue(scala.collection.mutable.Map[String,
>> Int]().withDefaultValue(0))
>>
>>   val rowIt = rows.iterator
>>   var sentCount = 1
>>   for (row <- rowIt) {
>> sentCount = row(29).toString().toInt
>> for (i <- 0 until row.length) {
>>   var m = activityMap(i)
>>   if (activityMap(i) == null) {
>> m = collection.mutable.Map[String,
>> Int]().withDefaultValue(0)
>>   }
>>   m(row(i).toString()) += 1
>>   activityMap.update(i, m)
>> }
>>   }
>>   var activityPPRow: Row = Row()
>>   for((k,v) <- activityMap) {
>>   var rowVal:String = ""
>>   for((a,b) <- v) {
>> rowVal += rowVal + a + ":" + b/sentCount + "|"
>>   }
>>   returnValue.update(k, rowVal)
>> //  activityPPRow.apply(k) = rowVal
>>   }
>>
>> }
>> return returnValue
>>   }
>>
>> When I run step 2 I get the following error. I am new to Scala and Spark
>> and am unable to figure out how to pass the Iterable[Row] to a function and
>> get back the results.
>>
>> org.apache.spark.SparkException: Task not serializable
>> at
>> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
>> at
>> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
>> at
>> org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
>> at org.apache.spark.SparkContext.clean(SparkContext.scala:2032)
>> at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:318)
>> at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:317)
>> at
>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>> at
>>

Re: a lot of warnings when build spark 1.6.0

2016-01-20 Thread Sean Owen

These are just warnings. Most are unavoidable given the version of Hadoop
supported vs what you build with.

On Thu, Jan 21, 2016, 08:08 Eli Super  wrote:

> Hi
>
> I get WARNINGS when try to build spark 1.6.0
>
> overall I get SUCCESS message on all projects
>
> command I used :
>
> mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Dscala-2.10 -Phive
> -Phive-thriftserver  -DskipTests clean package
>
> from pom.xml
>
>  2.10.5
>  2.10
>
>
> example of warnings :
>
>
> [INFO]
> 
> [INFO] Building Spark Project Core 1.6.0
> [INFO]
> 
> [INFO]
> [INFO] --- maven-clean-plugin:2.6.1:clean (default-clean) @
> spark-core_2.10 ---
> [INFO] Deleting C:\spark-1.6.0\core\target
> [INFO]
> [INFO] --- maven-enforcer-plugin:1.4:enforce (enforce-versions) @
> spark-core_2.10 ---
> [INFO]
> [INFO] --- scala-maven-plugin:3.2.2:add-source (eclipse-add-source) @
> spark-core_2.10 ---
> [INFO] Add Source directory: C:\spark-1.6.0\core\src\main\scala
> [INFO] Add Test Source directory: C:\spark-1.6.0\core\src\test\scala
> [INFO]
> [INFO] --- maven-remote-resources-plugin:1.5:process (default) @
> spark-core_2.10 ---
> [INFO]
> [INFO] --- maven-resources-plugin:2.6:resources (default-resources) @
> spark-core_2.10 ---
> [INFO] Using 'UTF-8' encoding to copy filtered resources.
> [INFO] Copying 21 resources
> [INFO] Copying 3 resources
> [INFO]
> [INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @
> spark-core_2.10 ---
> [WARNING] Zinc server is not available at port 3030 - reverting to normal
> incremental compile
> [INFO] Using incremental compilation
> [INFO] Compiling 486 Scala sources and 76 Java sources to
> C:\spark-1.6.0\core\target\scala-2.10\classes...
> [WARNING]
> C:\spark-1.6.0\core\src\main\scala\org\apache\spark\storage\TachyonBlockManager.scala:104:
> value TRY_CACHE in object WriteType is deprecated: see corresponding
> Javadoc fo
> r more information.
> [WARNING] val os = file.getOutStream(WriteType.TRY_CACHE)
> [WARNING]  ^
> [WARNING]
> C:\spark-1.6.0\core\src\main\scala\org\apache\spark\storage\TachyonBlockManager.scala:118:
> value TRY_CACHE in object WriteType is deprecated: see corresponding
> Javadoc fo
> r more information.
> [WARNING] val os = file.getOutStream(WriteType.TRY_CACHE)
> [WARNING]  ^
> [WARNING]
> C:\spark-1.6.0\core\src\main\scala\org\apache\spark\SparkContext.scala:186:
> constructor SparkContext in class SparkContext is deprecated: Passing in
> preferred locations h
> as no effect at all, see SPARK-10921
> [WARNING] this(master, appName, null, Nil, Map())
> [WARNING] ^
> [WARNING]
> C:\spark-1.6.0\core\src\main\scala\org\apache\spark\SparkContext.scala:196:
> constructor SparkContext in class SparkContext is deprecated: Passing in
> preferred locations h
> as no effect at all, see SPARK-10921
> [WARNING] this(master, appName, sparkHome, Nil, Map())
> [WARNING] ^
> [WARNING]
> C:\spark-1.6.0\core\src\main\scala\org\apache\spark\SparkContext.scala:208:
> constructor SparkContext in class SparkContext is deprecated: Passing in
> preferred locations h
> as no effect at all, see SPARK-10921
> [WARNING] this(master, appName, sparkHome, jars, Map())
> [WARNING] ^
> [WARNING]
> C:\spark-1.6.0\core\src\main\scala\org\apache\spark\SparkContext.scala:871:
> constructor Job in class Job is deprecated: see corresponding Javadoc for
> more information.
> [WARNING] val job = new NewHadoopJob(hadoopConfiguration)
> [WARNING]   ^
> [WARNING]
> C:\spark-1.6.0\core\src\main\scala\org\apache\spark\SparkContext.scala:920:
> constructor Job in class Job is deprecated: see corresponding Javadoc for
> more information.
> [WARNING] val job = new NewHadoopJob(hadoopConfiguration)
> [WARNING]   ^
> [WARNING]
> C:\spark-1.6.0\core\src\main\scala\org\apache\spark\SparkContext.scala:1099:
> constructor Job in class Job is deprecated: see corresponding Javadoc for
> more information.
> [WARNING] val job = new NewHadoopJob(conf)
> [WARNING]   ^
> [WARNING]
> C:\spark-1.6.0\core\src\main\scala\org\apache\spark\SparkContext.scala:1366:
> method isDir in class FileStatus is deprecated: see corresponding Javadoc
> for more informatio
> n.
> [WARNING]   val isDir = fs.getFileStatus(hadoopPath).isDir
> [WARNING]^
> [WARNING]
> C:\spark-1.6.0\core\src\main\scala\org\apache\spark\SparkEnv.scala:101:
> value actorSystem in class SparkEnv is deprecated: Actor system is no
> longer supported as of 1.4.0
>
> [WARNING] actorSystem.shutdown()
> [WARNING] ^
> [WARNING]
> C:\spark-1.6.0\core\src\main\scala\org\apache\spark\SparkHadoopWriter.scala:153:
> constructor TaskID in class TaskID is deprecated: see corresponding Javadoc
>

Re: Parquet write optimization by row group size config

2016-01-20 Thread Jörn Franke

What is your data size, the algorithm and the expected time?
Depending on this the group can recommend you optimizations or tell you that 
the expectations are wrong

> On 20 Jan 2016, at 18:24, Pavel Plotnikov  
> wrote:
> 
> Thanks, Akhil! It helps, but this jobs still not fast enough, maybe i missed 
> something
> 
> Regards,
> Pavel
> 
>> On Wed, Jan 20, 2016 at 9:51 AM Akhil Das  wrote:
>> Did you try re-partitioning the data before doing the write?
>> 
>> Thanks
>> Best Regards
>> 
>>> On Tue, Jan 19, 2016 at 6:13 PM, Pavel Plotnikov 
>>>  wrote:
>>> Hello, 
>>> I'm using spark on some machines in standalone mode, data storage is 
>>> mounted on this machines via nfs. A have input data stream and when i'm 
>>> trying to store all data for hour in parquet, a job executes mostly on one 
>>> core and this hourly data are stored in 40- 50 minutes. It is very slow! 
>>> And it is not IO problem. After research how parquet file works, i'm found 
>>> that it can be parallelized on row group abstraction level. 
>>> I think row group for my files is to large, and how can i change it? 
>>> When i create to big DataFrame i devides in parts very well and writes 
>>> quikly!
>>> 
>>> Thanks,
>>> Pavel

Re: trouble implementing complex transformer in java that can be used with Pipeline. Scala to Java porting problem

2016-01-20 Thread Andy Davidson

Very Nice!. Many thanks Kevin. I wish I found this out a couple of weeks
ago.

Andy

From:  Kevin Mellott 
Date:  Wednesday, January 20, 2016 at 4:34 PM
To:  Andrew Davidson 
Cc:  "user @spark" 
Subject:  Re: trouble implementing complex transformer in java that can be
used with Pipeline. Scala to Java porting problem

> Hi Andy,
> 
> According to the API documentation for DataFrame
>  .DataFrame> , you should have access to sqlContext as a property off of the
> DataFrame instance. In your example, you could then do something like:
> 
> df.sqlContext.udf.register(...)
> 
> Thanks,
> Kevin
> 
> On Wed, Jan 20, 2016 at 6:15 PM, Andy Davidson 
> wrote:
>> For clarity callUDF() is not defined on DataFrames. It is defined on
>> org.apache.spark.sql.functions . Strange the class name starts with lower
>> case. I have not figure out how to use function class.
>> 
>> http://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.h
>> tml
>> 
>> Andy
>> 
>> From:  Andrew Davidson 
>> Date:  Wednesday, January 20, 2016 at 4:05 PM
>> To:  "user @spark" 
>> Subject:  trouble implementing  complex transformer in java that can be used
>> with Pipeline. Scala to Java porting problem
>> 
>>> I am using 1.6.0. I am having trouble implementing a custom transformer
>>> derived from org.apache.spark.ml.Transformer in Java that I can use in a
>>> PipeLine.
>>> 
>>> So far the only way I figure out how to implement any kind of complex
>>> functionality and have it applied to a DataFrame is to implement a UDF. For
>>> example
>>> 
>>> 
>>>class StemmerUDF implements UDF1, Serializable {
>>> 
>>> private static final long serialVersionUID = 1L;
>>> 
>>> 
>>> 
>>> @Override
>>> 
>>> public List call(String text) throws Exception {
>>> 
>>> List ret = stemText(text); //call org.apache.lucene
>>> 
>>> return ret;
>>> 
>>> }
>>> 
>>> }
>>> 
>>> 
>>> 
>>> Before I can use the UDF it needs to be registered. This requires the
>>> sqlContext. The problem is sqlContext is not available during
>>> pipeline.load()
>>> 
>>> 
>>>void registerUDF(SQLContext sqlContext) {
>>> 
>>> if (udf == null) {
>>> 
>>> udf = new StemmerUDF();
>>> 
>>> DataType returnType =
>>> DataTypes.createArrayType(DataTypes.StringType);
>>> 
>>> sqlContext.udf().register(udfName, udf, returnType);
>>> 
>>> }
>>> 
>>> }
>>> 
>>> 
>>> Our transformer needs to implement transform(). For it to be able to use the
>>> registered UDF we need the sqlContext. The problem is the sqlContext is not
>>> part of the signature of transform. My current hack is to pass the
>>> sqlContext to the constructor and not to use pipelines
>>>   @Override
>>> 
>>> public DataFrame transform(DataFrame df) {
>>> 
>>>   String fmt = "%s(%s) as %s";
>>> 
>>> String stmt = String.format(fmt, udfName, inputCol, outputCol);
>>> 
>>> logger.info("\nstmt: {}", stmt);
>>> 
>>> DataFrame ret = df.selectExpr("*", stmt);
>>> 
>>> return ret;
>>> 
>>> }
>>> 
>>> 
>>> 
>>> Is they a way to do something like df.callUDF(myUDF);
>>> 
>>> 
>>> 
>>> The following Scala code looks like it is close to what I need. I not been
>>> able to figure out how do something like this in Java 8. callUDF does not
>>> seem to be avaliable.
>>> 
>>> 
>>> 
>>> spark/spark-1.6.0/mllib/src/main/scala/org/apache/spark/ml/Transformer.scala
>>> 
>>> @DeveloperApi
>>> 
>>> abstract class UnaryTransformer[IN, OUT, T <: UnaryTransformer[IN, OUT, T]]
>>> 
>>>   extends Transformer with HasInputCol with HasOutputCol with Logging {
>>> 
>>> 
>>> 
>>> . . .
>>> 
>>> 
>>> 
>>>  override def transform(dataset: DataFrame): DataFrame = {
>>> 
>>> transformSchema(dataset.schema, logging = true)
>>> 
>>> dataset.withColumn($(outputCol),
>>> 
>>>   callUDF(this.createTransformFunc, outputDataType,
>>> dataset($(inputCol
>>> 
>>>   }
>>> 
>>> 
>>> 
>>> spark/spark-1.6.0/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer
>>> .scala 
>>> 
>>> 
>>> 
>>> class Tokenizer(override val uid: String)
>>> 
>>>   extends UnaryTransformer[String, Seq[String], Tokenizer] with
>>> DefaultParamsWritable {
>>> 
>>> 
>>> 
>>> . . .
>>> 
>>>   override protected def createTransformFunc: String => Seq[String] = {
>>> 
>>> _.toLowerCase.split("\\s")
>>> 
>>>   }
>>> 
>>> . . .
>>> 
>>> }
>>> 
>>> 
>>> 
>>> Kind regards
>>> 
>>> 
>>> 
>>> Andy
>>> 
>>> 
>

retrieve cell value from a rowMatrix.

2016-01-20 Thread Srivathsan Srinivas

Hi,
   Is there a way to retrieve the cell value of a rowMatrix? Like m(i,j)?
The docs say that the indices are long. Maybe I am doing something
wrong...but, there doesn't seem to be any such direct method.

Any suggestions?

-- 
Thanks,
Srini.

Getting all field value as Null while reading Hive Table with Partition

2016-01-20 Thread Bijay Pathak

Hello,

I am getting all the value of field as NULL while reading Hive Table with
Partition in SPARK 1.5.0 running on CDH5.5.1 with YARN (Dynamic Allocation).

Below is the command I used in Spark_Shell:

import org.apache.spark.sql.hive.HiveContext
val hiveContext = new HiveContext(sc)
val hiveTable = hiveContext.table("test_db.partition_table")

Am I missing something? Do I need to provide other info while reading
Partition table?

Thanks,
Bijay

Re: Scala MatchError in Spark SQL

2016-01-20 Thread Raghu Ganti

Ah, OK! I am a novice to Scala - will take a look at Scala case classes. It
would be awesome if you can provide some pointers.

Thanks,
Raghu

On Wed, Jan 20, 2016 at 12:25 PM, Andy Grove 
wrote:

> I'm talking about implementing CustomerRecord as a scala case class,
> rather than as a Java class. Scala case classes implement the scala.Product
> trait, which Catalyst is looking for.
>
>
> Thanks,
>
> Andy.
>
> --
>
> Andy Grove
> Chief Architect
> AgilData - Simple Streaming SQL that Scales
> www.agildata.com
>
>
> On Wed, Jan 20, 2016 at 10:21 AM, Raghu Ganti 
> wrote:
>
>> Is it not internal to the Catalyst implementation? I should not be
>> modifying the Spark source to get things to work, do I? :-)
>>
>> On Wed, Jan 20, 2016 at 12:21 PM, Raghu Ganti 
>> wrote:
>>
>>> Case classes where?
>>>
>>> On Wed, Jan 20, 2016 at 12:21 PM, Andy Grove 
>>> wrote:
>>>
 Honestly, moving to Scala and using case classes is the path of least
 resistance in the long term.



 Thanks,

 Andy.

 --

 Andy Grove
 Chief Architect
 AgilData - Simple Streaming SQL that Scales
 www.agildata.com


 On Wed, Jan 20, 2016 at 10:19 AM, Raghu Ganti 
 wrote:

> Thanks for your reply, Andy.
>
> Yes, that is what I concluded based on the Stack trace. The problem is
> stemming from Java implementation of generics, but I thought this will go
> away if you compiled against Java 1.8, which solves the issues of proper
> generic implementation.
>
> Any ideas?
>
> Also, are you saying that in order for my example to work, I would
> need to move to Scala and have the UDT implemented in Scala?
>
>
> On Wed, Jan 20, 2016 at 10:27 AM, Andy Grove 
> wrote:
>
>> Catalyst is expecting a class that implements scala.Row or
>> scala.Product and is instead finding a Java class. I've run into this 
>> issue
>> a number of times. Dataframe doesn't work so well with Java. Here's a 
>> blog
>> post with more information on this:
>>
>> http://www.agildata.com/apache-spark-rdd-vs-dataframe-vs-dataset/
>>
>>
>> Thanks,
>>
>> Andy.
>>
>> --
>>
>> Andy Grove
>> Chief Architect
>> AgilData - Simple Streaming SQL that Scales
>> www.agildata.com
>>
>>
>> On Wed, Jan 20, 2016 at 7:07 AM, raghukiran 
>> wrote:
>>
>>> Hi,
>>>
>>> I created a custom UserDefinedType in Java as follows:
>>>
>>> SQLPoint = new UserDefinedType() {
>>> //overriding serialize, deserialize, sqlType, userClass functions
>>> here
>>> }
>>>
>>> When creating a dataframe, I am following the manual mapping, I have
>>> a
>>> constructor for JavaPoint - JavaPoint(double x, double y) and a
>>> Customer
>>> record as follows:
>>>
>>> public class CustomerRecord {
>>> private int id;
>>> private String name;
>>> private Object location;
>>>
>>> //setters and getters follow here
>>> }
>>>
>>> Following the example in Spark source, when I create a RDD as
>>> follows:
>>>
>>> sc.textFile(inputFileName).map(new Function>> CustomerRecord>() {
>>> //call method
>>> CustomerRecord rec = new CustomerRecord();
>>> rec.setLocation(SQLPoint.serialize(new JavaPoint(x, y)));
>>> });
>>>
>>> This results in a MatchError. The stack trace is as follows:
>>>
>>> scala.MatchError: [B@45aa3dd5 (of class [B)
>>> at
>>>
>>> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:255)
>>> at
>>>
>>> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
>>> at
>>>
>>> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>>> at
>>>
>>> org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
>>> at
>>>
>>> org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1358)
>>> at
>>>
>>> org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1358)
>>> at
>>>
>>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>>> at
>>>
>>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>>> at
>>>
>>>

Re: Parquet write optimization by row group size config

2016-01-20 Thread Pavel Plotnikov

Thanks, Akhil! It helps, but this jobs still not fast enough, maybe i
missed something

Regards,
Pavel

On Wed, Jan 20, 2016 at 9:51 AM Akhil Das 
wrote:

> Did you try re-partitioning the data before doing the write?
>
> Thanks
> Best Regards
>
> On Tue, Jan 19, 2016 at 6:13 PM, Pavel Plotnikov <
> pavel.plotni...@team.wrike.com> wrote:
>
>> Hello,
>> I'm using spark on some machines in standalone mode, data storage is
>> mounted on this machines via nfs. A have input data stream and when i'm
>> trying to store all data for hour in parquet, a job executes mostly on one
>> core and this hourly data are stored in 40- 50 minutes. It is very slow!
>> And it is not IO problem. After research how parquet file works, i'm found
>> that it can be parallelized on row group abstraction level.
>> I think row group for my files is to large, and how can i change it?
>> When i create to big DataFrame i devides in parts very well and writes
>> quikly!
>>
>> Thanks,
>> Pavel
>>
>
>

Re: I need help mapping a PairRDD solution to Dataset

2016-01-20 Thread Michael Armbrust

The analog to PairRDD is a GroupedDataset (created by calling groupBy),
which offers similar functionality, but doesn't require you to construct
new object that are in the form of key/value pairs.  It doesn't matter if
they are complex objects, as long as you can create an encoder for them
(currently supported for JavaBeans and case classes, but support for custom
encoders is on the roadmap).  These encoders are responsible for both fast
serialization and providing a view of your object that looks like a row.

Based on the description of your problem, it sounds like you can use
joinWith and just express the predicate as a column.

import org.apache.spark.sql.functions._
ds1.as("child").joinWith(ds2.as("school"), expr("child.region =
school.region"))

The as operation is only required if you need to differentiate columns on
either side that have the same name.

Note that by defining the join condition as an expression instead of a
lambda function, we are giving Spark SQL more information about the join so
it can often do the comparison without needing to deserialize the object,
which overtime will let us put more optimizations into the engine.

You can also do this using lambda functions if you want though:

ds1.groupBy(_.region).cogroup(ds2.groupBy(_.region) { (key, iter1, iter2) =>
  ...
}

On Wed, Jan 20, 2016 at 10:26 AM, Steve Lewis  wrote:

> We have been working a large search problem which we have been solving in
> the following ways.
>
> We have two sets of objects, say children and schools. The object is to
> find the closest school to each child. There is a distance measure but it
> is relatively expensive and would be very costly to apply to all pairs.
>
> However the map can be divided into regions. If we assume that the closest
> school to a child is in his region of a neighboring region we need only
> compute the distance between a child and all schools in his region and
> neighboring regions.
>
> We currently use paired RDDs and a join to do this assigning children to
> one region and schools to their own region and neighboring regions and then
> creating a join and computing distances. Note the real problem is more
> complex.
>
> I can create Datasets of the two types of objects but see no Dataset
> analog for a PairRDD. How could I map my solution using PairRDDs to
> Datasets - assume the two objects are relatively complex data types and do
> not look like SQL dataset rows?
>
>
>

Re: I need help mapping a PairRDD solution to Dataset

2016-01-20 Thread Steve Lewis

Thanks - this helps a lot except for the issue of looking at schools in
neighboring regions

On Wed, Jan 20, 2016 at 10:43 AM, Michael Armbrust 
wrote:

> The analog to PairRDD is a GroupedDataset (created by calling groupBy),
> which offers similar functionality, but doesn't require you to construct
> new object that are in the form of key/value pairs.  It doesn't matter if
> they are complex objects, as long as you can create an encoder for them
> (currently supported for JavaBeans and case classes, but support for custom
> encoders is on the roadmap).  These encoders are responsible for both fast
> serialization and providing a view of your object that looks like a row.
>
> Based on the description of your problem, it sounds like you can use
> joinWith and just express the predicate as a column.
>
> import org.apache.spark.sql.functions._
> ds1.as("child").joinWith(ds2.as("school"), expr("child.region =
> school.region"))
>
> The as operation is only required if you need to differentiate columns on
> either side that have the same name.
>
> Note that by defining the join condition as an expression instead of a
> lambda function, we are giving Spark SQL more information about the join so
> it can often do the comparison without needing to deserialize the object,
> which overtime will let us put more optimizations into the engine.
>
> You can also do this using lambda functions if you want though:
>
> ds1.groupBy(_.region).cogroup(ds2.groupBy(_.region) { (key, iter1, iter2)
> =>
>   ...
> }
>
>
> On Wed, Jan 20, 2016 at 10:26 AM, Steve Lewis 
> wrote:
>
>> We have been working a large search problem which we have been solving in
>> the following ways.
>>
>> We have two sets of objects, say children and schools. The object is to
>> find the closest school to each child. There is a distance measure but it
>> is relatively expensive and would be very costly to apply to all pairs.
>>
>> However the map can be divided into regions. If we assume that the
>> closest school to a child is in his region of a neighboring region we need
>> only compute the distance between a child and all schools in his region and
>> neighboring regions.
>>
>> We currently use paired RDDs and a join to do this assigning children to
>> one region and schools to their own region and neighboring regions and then
>> creating a join and computing distances. Note the real problem is more
>> complex.
>>
>> I can create Datasets of the two types of objects but see no Dataset
>> analog for a PairRDD. How could I map my solution using PairRDDs to
>> Datasets - assume the two objects are relatively complex data types and do
>> not look like SQL dataset rows?
>>
>>
>>
>


-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: Scala MatchError in Spark SQL

2016-01-20 Thread Andy Grove

Catalyst is expecting a class that implements scala.Row or scala.Product
and is instead finding a Java class. I've run into this issue a number of
times. Dataframe doesn't work so well with Java. Here's a blog post with
more information on this:

http://www.agildata.com/apache-spark-rdd-vs-dataframe-vs-dataset/


Thanks,

Andy.

--

Andy Grove
Chief Architect
AgilData - Simple Streaming SQL that Scales
www.agildata.com


On Wed, Jan 20, 2016 at 7:07 AM, raghukiran  wrote:

> Hi,
>
> I created a custom UserDefinedType in Java as follows:
>
> SQLPoint = new UserDefinedType() {
> //overriding serialize, deserialize, sqlType, userClass functions here
> }
>
> When creating a dataframe, I am following the manual mapping, I have a
> constructor for JavaPoint - JavaPoint(double x, double y) and a Customer
> record as follows:
>
> public class CustomerRecord {
> private int id;
> private String name;
> private Object location;
>
> //setters and getters follow here
> }
>
> Following the example in Spark source, when I create a RDD as follows:
>
> sc.textFile(inputFileName).map(new Function() {
> //call method
> CustomerRecord rec = new CustomerRecord();
> rec.setLocation(SQLPoint.serialize(new JavaPoint(x, y)));
> });
>
> This results in a MatchError. The stack trace is as follows:
>
> scala.MatchError: [B@45aa3dd5 (of class [B)
> at
>
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:255)
> at
>
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
> at
>
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
> at
>
> org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
> at
>
> org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1358)
> at
>
> org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1358)
> at
>
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
>
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
>
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at
> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
> at
>
> org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1.apply(SQLContext.scala:1358)
> at
>
> org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1.apply(SQLContext.scala:1356)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at scala.collection.TraversableOnce$class.to
> (TraversableOnce.scala:273)
> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> at
>
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
> at
>
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
> at
>
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
> at
>
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at
>
>

Re: Scala MatchError in Spark SQL

2016-01-20 Thread Andy Grove

I'm talking about implementing CustomerRecord as a scala case class, rather
than as a Java class. Scala case classes implement the scala.Product trait,
which Catalyst is looking for.


Thanks,

Andy.

--

Andy Grove
Chief Architect
AgilData - Simple Streaming SQL that Scales
www.agildata.com


On Wed, Jan 20, 2016 at 10:21 AM, Raghu Ganti  wrote:

> Is it not internal to the Catalyst implementation? I should not be
> modifying the Spark source to get things to work, do I? :-)
>
> On Wed, Jan 20, 2016 at 12:21 PM, Raghu Ganti 
> wrote:
>
>> Case classes where?
>>
>> On Wed, Jan 20, 2016 at 12:21 PM, Andy Grove 
>> wrote:
>>
>>> Honestly, moving to Scala and using case classes is the path of least
>>> resistance in the long term.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Andy.
>>>
>>> --
>>>
>>> Andy Grove
>>> Chief Architect
>>> AgilData - Simple Streaming SQL that Scales
>>> www.agildata.com
>>>
>>>
>>> On Wed, Jan 20, 2016 at 10:19 AM, Raghu Ganti 
>>> wrote:
>>>
 Thanks for your reply, Andy.

 Yes, that is what I concluded based on the Stack trace. The problem is
 stemming from Java implementation of generics, but I thought this will go
 away if you compiled against Java 1.8, which solves the issues of proper
 generic implementation.

 Any ideas?

 Also, are you saying that in order for my example to work, I would need
 to move to Scala and have the UDT implemented in Scala?


 On Wed, Jan 20, 2016 at 10:27 AM, Andy Grove 
 wrote:

> Catalyst is expecting a class that implements scala.Row or
> scala.Product and is instead finding a Java class. I've run into this 
> issue
> a number of times. Dataframe doesn't work so well with Java. Here's a blog
> post with more information on this:
>
> http://www.agildata.com/apache-spark-rdd-vs-dataframe-vs-dataset/
>
>
> Thanks,
>
> Andy.
>
> --
>
> Andy Grove
> Chief Architect
> AgilData - Simple Streaming SQL that Scales
> www.agildata.com
>
>
> On Wed, Jan 20, 2016 at 7:07 AM, raghukiran 
> wrote:
>
>> Hi,
>>
>> I created a custom UserDefinedType in Java as follows:
>>
>> SQLPoint = new UserDefinedType() {
>> //overriding serialize, deserialize, sqlType, userClass functions here
>> }
>>
>> When creating a dataframe, I am following the manual mapping, I have a
>> constructor for JavaPoint - JavaPoint(double x, double y) and a
>> Customer
>> record as follows:
>>
>> public class CustomerRecord {
>> private int id;
>> private String name;
>> private Object location;
>>
>> //setters and getters follow here
>> }
>>
>> Following the example in Spark source, when I create a RDD as follows:
>>
>> sc.textFile(inputFileName).map(new Function()
>> {
>> //call method
>> CustomerRecord rec = new CustomerRecord();
>> rec.setLocation(SQLPoint.serialize(new JavaPoint(x, y)));
>> });
>>
>> This results in a MatchError. The stack trace is as follows:
>>
>> scala.MatchError: [B@45aa3dd5 (of class [B)
>> at
>>
>> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:255)
>> at
>>
>> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
>> at
>>
>> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>> at
>>
>> org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
>> at
>>
>> org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1358)
>> at
>>
>> org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1358)
>> at
>>
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>> at
>>
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>> at
>>
>> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>> at
>> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>> at
>> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>> at
>> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>> at
>>
>>

Re: Using Spark, SparkR and Ranger, please help.

2016-01-20 Thread Ted Yu

The tail of the stack trace seems to be chopped off.

Can you include the whole trace ?

Which version of Spark / Hive / Ranger are you using ?

Cheers

On Wed, Jan 20, 2016 at 9:42 AM, Julien Carme 
wrote:

> Hello,
>
> I have been able to use Spark with Apache Ranger. I had the right
> configuration files to Spark conf, I add Ranger jars to the classpath and
> it works, Spark complies to Ranger rules when I access Hive tables.
>
> However with SparkR it does not work, which is rather surprising
> considering SparkR is supposed to be just a layer over Spark. I don't
> understand why sparkR seem to behave differently, maybe I am just missing
> something.
>
> So when I run Spark when I do:
>
> sqlContext.sql("show databases").collect()
>
> it works, I get all my Hive databases.
>
> But in sparkR it does not behave the same way.
>
> when I do:
>
> sql(sqlContext,"show databases")
>
> ...
> Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
>   java.lang.RuntimeException: [1.1] failure: ``with'' expected but
> identifier show found
> ...
>
> From the documentation it seems that I need to instanciate an hiveContext.
>
> hiveContext <- sparkRHive.init(sc)
> sql(hiveContext, "show databases")
>
> ...
> 16/01/20 18:37:20 ERROR RBackendHandler: sql on 2 failed
> Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
>   java.lang.AssertionError: Authorization plugins not initialized!
> at
> org.apache.hadoop.hive.ql.session.SessionState.getAuthorizationMode(SessionState.java:1511)
> at
> org.apache.hadoop.hive.ql.session.SessionState.isAuthorizationModeV2(SessionState.java:1515)
> at
> org.apache.hadoop.hive.ql.Driver.doAuthorization(Driver.java:566)
> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:468)
> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:308)
> at
> org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1122)
> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1170)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)
> at
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:484)
> at
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:473)
> at
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.appl
> ...
>
> Any help would be appreciated.
>
> Regards,
>
> Julien
>

How to debug join operations on a cluster.

2016-01-20 Thread Borislav Iordanov

Hi, 

I'm reading data from HBase using the latest (2.0.0-SNAPSHOT) Hbase-Spark 
integration module. HBase is deployed on a cluster of 3 machines and Spark is 
deployed as a Standalone cluster on the same machines. 

I am doing a join between two JavaPairRDDs that are constructed from two 
separate HBase tables. An RDD is obtained from an Hbase table scan, then it 
transformed into a pair RDD with the row key from the table. 

When I run my Spark program as a standalone process, either on my development 
machine or on one of the cluster machines, the join returns a correct, 
non-empty result. When I submit the exact same program to the Spark cluster, 
the join comes out empty. In both cases I'm connecting to the Spark master on 
the cluster.  In summary: 

1) mvn exec:java   prints out correct non-empty join 
2) spark-submit --deploy-mode client --class same_main_class --master 
cluster_master_url  prints out empty join 
3) spark-submit --deploy-mode cluster --class same_main_class --master 
cluster_master_url  also prints out empty join 

The spark version deployed is 1.5.1. The same version is declared as a Maven 
dependency. I've also tried with 1.5.2 and 1.6.0, redeploying the cluster etc. 
I've spent a few days trying to troubleshoot this but to no avail. I print out 
a count of the RDDs that I'm joining and it always gives me the correct. Only, 
the join doesn't work I submit it as a job to the cluster, regardless of where 
the Spark driver is. 

Can anybody give me some pointers how to debug this? I'm assuming the RDD is 
partitioned and shuffled and whatever is happening behind the scenes, except it 
is not behaving correctly, there aren't any exceptions, errors or even warnings 
and I have no clue why the join would be empty. Again: identical code run as a 
standalone program works, but when submitted to the cluster doesn't. 

I'm mainly looking for troubleshooting tips here! 

Thanks much in advance! 
Boris

Re: Scala MatchError in Spark SQL

2016-01-20 Thread Andy Grove

I would walk through a Spark tutorial in Scala. It will be the best way to
learn this.

In brief though, a Scala case class is like a Java bean / pojo but has a
more concise syntax (no getters/setters).

case class Person(firstName: String, lastName: String, age: Int)


Thanks,

Andy.

--

Andy Grove
Chief Architect
AgilData - Simple Streaming SQL that Scales
www.agildata.com


On Wed, Jan 20, 2016 at 10:28 AM, Raghu Ganti  wrote:

> Ah, OK! I am a novice to Scala - will take a look at Scala case classes.
> It would be awesome if you can provide some pointers.
>
> Thanks,
> Raghu
>
> On Wed, Jan 20, 2016 at 12:25 PM, Andy Grove 
> wrote:
>
>> I'm talking about implementing CustomerRecord as a scala case class,
>> rather than as a Java class. Scala case classes implement the scala.Product
>> trait, which Catalyst is looking for.
>>
>>
>> Thanks,
>>
>> Andy.
>>
>> --
>>
>> Andy Grove
>> Chief Architect
>> AgilData - Simple Streaming SQL that Scales
>> www.agildata.com
>>
>>
>> On Wed, Jan 20, 2016 at 10:21 AM, Raghu Ganti 
>> wrote:
>>
>>> Is it not internal to the Catalyst implementation? I should not be
>>> modifying the Spark source to get things to work, do I? :-)
>>>
>>> On Wed, Jan 20, 2016 at 12:21 PM, Raghu Ganti 
>>> wrote:
>>>
 Case classes where?

 On Wed, Jan 20, 2016 at 12:21 PM, Andy Grove 
 wrote:

> Honestly, moving to Scala and using case classes is the path of least
> resistance in the long term.
>
>
>
> Thanks,
>
> Andy.
>
> --
>
> Andy Grove
> Chief Architect
> AgilData - Simple Streaming SQL that Scales
> www.agildata.com
>
>
> On Wed, Jan 20, 2016 at 10:19 AM, Raghu Ganti 
> wrote:
>
>> Thanks for your reply, Andy.
>>
>> Yes, that is what I concluded based on the Stack trace. The problem
>> is stemming from Java implementation of generics, but I thought this will
>> go away if you compiled against Java 1.8, which solves the issues of 
>> proper
>> generic implementation.
>>
>> Any ideas?
>>
>> Also, are you saying that in order for my example to work, I would
>> need to move to Scala and have the UDT implemented in Scala?
>>
>>
>> On Wed, Jan 20, 2016 at 10:27 AM, Andy Grove > > wrote:
>>
>>> Catalyst is expecting a class that implements scala.Row or
>>> scala.Product and is instead finding a Java class. I've run into this 
>>> issue
>>> a number of times. Dataframe doesn't work so well with Java. Here's a 
>>> blog
>>> post with more information on this:
>>>
>>> http://www.agildata.com/apache-spark-rdd-vs-dataframe-vs-dataset/
>>>
>>>
>>> Thanks,
>>>
>>> Andy.
>>>
>>> --
>>>
>>> Andy Grove
>>> Chief Architect
>>> AgilData - Simple Streaming SQL that Scales
>>> www.agildata.com
>>>
>>>
>>> On Wed, Jan 20, 2016 at 7:07 AM, raghukiran 
>>> wrote:
>>>
 Hi,

 I created a custom UserDefinedType in Java as follows:

 SQLPoint = new UserDefinedType() {
 //overriding serialize, deserialize, sqlType, userClass functions
 here
 }

 When creating a dataframe, I am following the manual mapping, I
 have a
 constructor for JavaPoint - JavaPoint(double x, double y) and a
 Customer
 record as follows:

 public class CustomerRecord {
 private int id;
 private String name;
 private Object location;

 //setters and getters follow here
 }

 Following the example in Spark source, when I create a RDD as
 follows:

 sc.textFile(inputFileName).map(new Function() {
 //call method
 CustomerRecord rec = new CustomerRecord();
 rec.setLocation(SQLPoint.serialize(new JavaPoint(x, y)));
 });

 This results in a MatchError. The stack trace is as follows:

 scala.MatchError: [B@45aa3dd5 (of class [B)
 at

 org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:255)
 at

 org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
 at

 org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
 at

 org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
 at

Using Spark, SparkR and Ranger, please help.

2016-01-20 Thread Julien Carme

Hello,

I have been able to use Spark with Apache Ranger. I had the right
configuration files to Spark conf, I add Ranger jars to the classpath and
it works, Spark complies to Ranger rules when I access Hive tables.

However with SparkR it does not work, which is rather surprising
considering SparkR is supposed to be just a layer over Spark. I don't
understand why sparkR seem to behave differently, maybe I am just missing
something.

So when I run Spark when I do:

sqlContext.sql("show databases").collect()

it works, I get all my Hive databases.

But in sparkR it does not behave the same way.

when I do:

sql(sqlContext,"show databases")

...
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  java.lang.RuntimeException: [1.1] failure: ``with'' expected but
identifier show found
...

>From the documentation it seems that I need to instanciate an hiveContext.

hiveContext <- sparkRHive.init(sc)
sql(hiveContext, "show databases")

...
16/01/20 18:37:20 ERROR RBackendHandler: sql on 2 failed
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  java.lang.AssertionError: Authorization plugins not initialized!
at
org.apache.hadoop.hive.ql.session.SessionState.getAuthorizationMode(SessionState.java:1511)
at
org.apache.hadoop.hive.ql.session.SessionState.isAuthorizationModeV2(SessionState.java:1515)
at org.apache.hadoop.hive.ql.Driver.doAuthorization(Driver.java:566)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:468)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:308)
at
org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1122)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1170)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)
at
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:484)
at
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:473)
at
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.appl
...

Any help would be appreciated.

Regards,

Julien

Re: How to use scala.math.Ordering in java

2016-01-20 Thread Ted Yu

Please take a look at the following files for some examples:

sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeKVExternalSorter.java
sql/catalyst/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowSorter.java

Cheers

On Wed, Jan 20, 2016 at 1:03 AM, ddav  wrote:

> Hi,
>
> I am writing my Spark application in java and I need to use a
> RangePartitioner.
>
> JavaPairRDD progRef1 =
> sc.textFile(programReferenceDataFile, 12).filter(
> (String s) ->
> !s.startsWith("#")).mapToPair(
> (String s) -> {
> ProgramDataRef ref = new
> ProgramDataRef(s);
> return new Tuple2 ProgramDataRef>(ref.genre, ref);
> }
> );
>
> RangePartitioner rangePart = new
> RangePartitioner(12, progRef1.rdd(), true, ?,
> progRef1.kClassTag());
>
> I can't determine how to create the correct object for parameter 4 which is
> "scala.math.Ordering evidence$1" from the documentation. From the
> scala.math.Ordering code I see there are many implicit objects and one
> handles Strings. How can I access them from Java.
>
> Thanks,
> Dave.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-scala-math-Ordering-in-java-tp26019.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Looking for the best tool that support structured DB and fast text indexing and searching with Spark

2016-01-20 Thread Khaled Al-Gumaei

Hello,

I would like to do some calculations of *Term Frequency* and *Document
Frequency* using spark.
BUT, I need my input to be from a database table (rows and columns) and the
output also to database table.
Which kind of tool I would use for the purpose of ( *supporting DB tables*
and *fast text searching and indexing*)??

Thanks in advance.

Best regards,
Khaled



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Looking-for-the-best-tool-that-support-structured-DB-and-fast-text-indexing-and-searching-with-Spark-tp26023.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: updateStateByKey not persisting in Spark 1.5.1

2016-01-20 Thread Shixiong(Ryan) Zhu

Could you share your log?

On Wed, Jan 20, 2016 at 7:55 AM, Brian London 
wrote:

> I'm running a streaming job that has two calls to updateStateByKey.  When
> run in standalone mode both calls to updateStateByKey behave as expected.
> When run on a cluster, however, it appears that the first call is not being
> checkpointed as shown in this DAG image:
>
> http://i.imgur.com/zmQ8O2z.png
>
> The middle column continues to grow one level deeper every batch until I
> get a stack overflow error.  I'm guessing its a problem of the stateRDD not
> being persisted, but I can't imagine why they wouldn't be.  I thought
> updateStateByKey was supposed to just handle that for you internally.
>
> Any ideas?
>
> I'll post stack trace excperpts of the stack overflow if anyone is
> interested below:
>
> Job aborted due to stage failure: Task 7 in stage 195811.0 failed 4 times,
> most recent failure: Lost task 7.3 in stage 195811.0 (TID 213529,
> ip-10-168-177-216.ec2.internal): java.lang.StackOverflowError at
> java.lang.Exception.(Exception.java:102) at
> java.lang.ReflectiveOperationException.(ReflectiveOperationException.java:89)
> at
> java.lang.reflect.InvocationTargetException.(InvocationTargetException.java:72)
> at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606) at
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058) at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1897) at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997) at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921) at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> ...
>
> And
>
> scala.collection.immutable.$colon$colon in readObject at line 362
> scala.collection.immutable.$colon$colon in readObject at line 366
> scala.collection.immutable.$colon$colon in readObject at line 362
> scala.collection.immutable.$colon$colon in readObject at line 362
> scala.collection.immutable.$colon$colon in readObject at line 366
> scala.collection.immutable.$colon$colon in readObject at line 362
> scala.collection.immutable.$colon$colon in readObject at line 362
> ...
>
>

Dataframe, Spark SQL - Drops First 8 Characters of String on Amazon EMR

2016-01-20 Thread awzurn

Hello,

I'm doing some work on Amazon's EMR cluster, and am noticing some peculiar
results when using both DataFrames to procure and operate on data, and also
when using Spark SQL within Zeppelin to run graphs/reports. Particularly,
I'm noticing that when using either of these on the EMR running Spark 1.5.2,
it will truncate the first 8 characters from a String. You can view a sample
of this in the attached images.

On the left is Spark running locally on my Mac, printing results from a
dataframe on a test set of data. On the right, running the same operations
on the same set of data on EMR.

 

Similar results when running spark sql using the %sql tag in Zeppelin for
graphing.

 

Additionally, when I transform these back to an RDD, results are shown as
wanted (on Amazon EMR).

 

I'm rather certain that this is not the intended behavior, especially
considering the Dataframe prints out the whole results running on my local
machine running the same version of Spark.

Is there a setting somewhere that might be causing this issue with
DataFrames and Spark SQL which could be causing this issue to come up?

Thanks,

Andrew Zurn

*Specs for EMR*
Release label:emr-4.2.0
Hadoop distribution:Amazon 2.6.0
Applications:Hive 1.0.0, Pig 0.14.0, Hue 3.7.1, Spark 1.5.2, Ganglia 3.6.0,
Mahout 0.11.0, Oozie-Sandbox 4.2.0, Presto-Sandbox 0.125, Zeppelin-Sandbox
0.5.5

Master:Running1c3.4xlarge
Core:Running10r3.4xlarge

*Additional Configuraitons*
spark.executor.cores5
spark.dynamicAllocation.enabled true
spark.serializerorg.apache.spark.serializer.KryoSerializer
spark.executor.memory   34G






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Dataframe-Spark-SQL-Drops-First-8-Characters-of-String-on-Amazon-EMR-tp26022.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

updateStateByKey not persisting in Spark 1.5.1

2016-01-20 Thread Brian London

I'm running a streaming job that has two calls to updateStateByKey.  When
run in standalone mode both calls to updateStateByKey behave as expected.
When run on a cluster, however, it appears that the first call is not being
checkpointed as shown in this DAG image:

http://i.imgur.com/zmQ8O2z.png

The middle column continues to grow one level deeper every batch until I
get a stack overflow error.  I'm guessing its a problem of the stateRDD not
being persisted, but I can't imagine why they wouldn't be.  I thought
updateStateByKey was supposed to just handle that for you internally.

Any ideas?

I'll post stack trace excperpts of the stack overflow if anyone is
interested below:

Job aborted due to stage failure: Task 7 in stage 195811.0 failed 4 times,
most recent failure: Lost task 7.3 in stage 195811.0 (TID 213529,
ip-10-168-177-216.ec2.internal): java.lang.StackOverflowError at
java.lang.Exception.(Exception.java:102) at
java.lang.ReflectiveOperationException.(ReflectiveOperationException.java:89)
at
java.lang.reflect.InvocationTargetException.(InvocationTargetException.java:72)
at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606) at
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058) at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1897) at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997) at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921) at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
...

And

scala.collection.immutable.$colon$colon in readObject at line 362
scala.collection.immutable.$colon$colon in readObject at line 366
scala.collection.immutable.$colon$colon in readObject at line 362
scala.collection.immutable.$colon$colon in readObject at line 362
scala.collection.immutable.$colon$colon in readObject at line 366
scala.collection.immutable.$colon$colon in readObject at line 362
scala.collection.immutable.$colon$colon in readObject at line 362
...

Re: Scala MatchError in Spark SQL

2016-01-20 Thread Raghu Ganti

Case classes where?

On Wed, Jan 20, 2016 at 12:21 PM, Andy Grove 
wrote:

> Honestly, moving to Scala and using case classes is the path of least
> resistance in the long term.
>
>
>
> Thanks,
>
> Andy.
>
> --
>
> Andy Grove
> Chief Architect
> AgilData - Simple Streaming SQL that Scales
> www.agildata.com
>
>
> On Wed, Jan 20, 2016 at 10:19 AM, Raghu Ganti 
> wrote:
>
>> Thanks for your reply, Andy.
>>
>> Yes, that is what I concluded based on the Stack trace. The problem is
>> stemming from Java implementation of generics, but I thought this will go
>> away if you compiled against Java 1.8, which solves the issues of proper
>> generic implementation.
>>
>> Any ideas?
>>
>> Also, are you saying that in order for my example to work, I would need
>> to move to Scala and have the UDT implemented in Scala?
>>
>>
>> On Wed, Jan 20, 2016 at 10:27 AM, Andy Grove 
>> wrote:
>>
>>> Catalyst is expecting a class that implements scala.Row or scala.Product
>>> and is instead finding a Java class. I've run into this issue a number of
>>> times. Dataframe doesn't work so well with Java. Here's a blog post with
>>> more information on this:
>>>
>>> http://www.agildata.com/apache-spark-rdd-vs-dataframe-vs-dataset/
>>>
>>>
>>> Thanks,
>>>
>>> Andy.
>>>
>>> --
>>>
>>> Andy Grove
>>> Chief Architect
>>> AgilData - Simple Streaming SQL that Scales
>>> www.agildata.com
>>>
>>>
>>> On Wed, Jan 20, 2016 at 7:07 AM, raghukiran 
>>> wrote:
>>>
 Hi,

 I created a custom UserDefinedType in Java as follows:

 SQLPoint = new UserDefinedType() {
 //overriding serialize, deserialize, sqlType, userClass functions here
 }

 When creating a dataframe, I am following the manual mapping, I have a
 constructor for JavaPoint - JavaPoint(double x, double y) and a Customer
 record as follows:

 public class CustomerRecord {
 private int id;
 private String name;
 private Object location;

 //setters and getters follow here
 }

 Following the example in Spark source, when I create a RDD as follows:

 sc.textFile(inputFileName).map(new Function() {
 //call method
 CustomerRecord rec = new CustomerRecord();
 rec.setLocation(SQLPoint.serialize(new JavaPoint(x, y)));
 });

 This results in a MatchError. The stack trace is as follows:

 scala.MatchError: [B@45aa3dd5 (of class [B)
 at

 org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:255)
 at

 org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
 at

 org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
 at

 org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
 at

 org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1358)
 at

 org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1358)
 at

 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at

 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at

 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at
 scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
 at
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at
 scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
 at

 org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1.apply(SQLContext.scala:1358)
 at

 org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1.apply(SQLContext.scala:1356)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at
 scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 at

Re: Scala MatchError in Spark SQL

2016-01-20 Thread Raghu Ganti

Is it not internal to the Catalyst implementation? I should not be
modifying the Spark source to get things to work, do I? :-)

On Wed, Jan 20, 2016 at 12:21 PM, Raghu Ganti  wrote:

> Case classes where?
>
> On Wed, Jan 20, 2016 at 12:21 PM, Andy Grove 
> wrote:
>
>> Honestly, moving to Scala and using case classes is the path of least
>> resistance in the long term.
>>
>>
>>
>> Thanks,
>>
>> Andy.
>>
>> --
>>
>> Andy Grove
>> Chief Architect
>> AgilData - Simple Streaming SQL that Scales
>> www.agildata.com
>>
>>
>> On Wed, Jan 20, 2016 at 10:19 AM, Raghu Ganti 
>> wrote:
>>
>>> Thanks for your reply, Andy.
>>>
>>> Yes, that is what I concluded based on the Stack trace. The problem is
>>> stemming from Java implementation of generics, but I thought this will go
>>> away if you compiled against Java 1.8, which solves the issues of proper
>>> generic implementation.
>>>
>>> Any ideas?
>>>
>>> Also, are you saying that in order for my example to work, I would need
>>> to move to Scala and have the UDT implemented in Scala?
>>>
>>>
>>> On Wed, Jan 20, 2016 at 10:27 AM, Andy Grove 
>>> wrote:
>>>
 Catalyst is expecting a class that implements scala.Row or
 scala.Product and is instead finding a Java class. I've run into this issue
 a number of times. Dataframe doesn't work so well with Java. Here's a blog
 post with more information on this:

 http://www.agildata.com/apache-spark-rdd-vs-dataframe-vs-dataset/


 Thanks,

 Andy.

 --

 Andy Grove
 Chief Architect
 AgilData - Simple Streaming SQL that Scales
 www.agildata.com


 On Wed, Jan 20, 2016 at 7:07 AM, raghukiran 
 wrote:

> Hi,
>
> I created a custom UserDefinedType in Java as follows:
>
> SQLPoint = new UserDefinedType() {
> //overriding serialize, deserialize, sqlType, userClass functions here
> }
>
> When creating a dataframe, I am following the manual mapping, I have a
> constructor for JavaPoint - JavaPoint(double x, double y) and a
> Customer
> record as follows:
>
> public class CustomerRecord {
> private int id;
> private String name;
> private Object location;
>
> //setters and getters follow here
> }
>
> Following the example in Spark source, when I create a RDD as follows:
>
> sc.textFile(inputFileName).map(new Function() {
> //call method
> CustomerRecord rec = new CustomerRecord();
> rec.setLocation(SQLPoint.serialize(new JavaPoint(x, y)));
> });
>
> This results in a MatchError. The stack trace is as follows:
>
> scala.MatchError: [B@45aa3dd5 (of class [B)
> at
>
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:255)
> at
>
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
> at
>
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
> at
>
> org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
> at
>
> org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1358)
> at
>
> org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1358)
> at
>
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
>
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
>
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at
> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
> at
>
> org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1.apply(SQLContext.scala:1358)
> at
>
> org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1.apply(SQLContext.scala:1356)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at

launching app using SparkLauncher

2016-01-20 Thread seemanto.barua

Hi,

I have a question on org.apache.spark.launcher.SparkLauncher

How is the JavaSparkContext made available to the main class.


-regards
Seemanto Barua




PLEASE READ: This message is for the named person's use only. It may contain 
confidential, proprietary or legally privileged information. No confidentiality 
or privilege is waived or lost by any mistransmission. If you receive this 
message in error, please delete it and all copies from your system, destroy any 
hard copies and notify the sender. You must not, directly or indirectly, use, 
disclose, distribute, print, or copy any part of this message if you are not 
the intended recipient. Nomura Holding America Inc., Nomura Securities 
International, Inc, and their respective subsidiaries each reserve the right to 
monitor all e-mail communications through its networks. Any views expressed in 
this message are those of the individual sender, except where the message 
states otherwise and the sender is authorized to state the views of such 
entity. Unless otherwise stated, any pricing information in this message is 
indicative only, is subject to change and does not constitute an offer to deal 
at any price quoted. Any reference to the terms of executed transactions should 
be treated as preliminary only and subject to our formal written confirmation.

Re: Scala MatchError in Spark SQL

2016-01-20 Thread Raghu Ganti

Thanks for your reply, Andy.

Yes, that is what I concluded based on the Stack trace. The problem is
stemming from Java implementation of generics, but I thought this will go
away if you compiled against Java 1.8, which solves the issues of proper
generic implementation.

Any ideas?

Also, are you saying that in order for my example to work, I would need to
move to Scala and have the UDT implemented in Scala?


On Wed, Jan 20, 2016 at 10:27 AM, Andy Grove 
wrote:

> Catalyst is expecting a class that implements scala.Row or scala.Product
> and is instead finding a Java class. I've run into this issue a number of
> times. Dataframe doesn't work so well with Java. Here's a blog post with
> more information on this:
>
> http://www.agildata.com/apache-spark-rdd-vs-dataframe-vs-dataset/
>
>
> Thanks,
>
> Andy.
>
> --
>
> Andy Grove
> Chief Architect
> AgilData - Simple Streaming SQL that Scales
> www.agildata.com
>
>
> On Wed, Jan 20, 2016 at 7:07 AM, raghukiran  wrote:
>
>> Hi,
>>
>> I created a custom UserDefinedType in Java as follows:
>>
>> SQLPoint = new UserDefinedType() {
>> //overriding serialize, deserialize, sqlType, userClass functions here
>> }
>>
>> When creating a dataframe, I am following the manual mapping, I have a
>> constructor for JavaPoint - JavaPoint(double x, double y) and a Customer
>> record as follows:
>>
>> public class CustomerRecord {
>> private int id;
>> private String name;
>> private Object location;
>>
>> //setters and getters follow here
>> }
>>
>> Following the example in Spark source, when I create a RDD as follows:
>>
>> sc.textFile(inputFileName).map(new Function() {
>> //call method
>> CustomerRecord rec = new CustomerRecord();
>> rec.setLocation(SQLPoint.serialize(new JavaPoint(x, y)));
>> });
>>
>> This results in a MatchError. The stack trace is as follows:
>>
>> scala.MatchError: [B@45aa3dd5 (of class [B)
>> at
>>
>> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:255)
>> at
>>
>> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
>> at
>>
>> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>> at
>>
>> org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
>> at
>>
>> org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1358)
>> at
>>
>> org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1358)
>> at
>>
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>> at
>>
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>> at
>>
>> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>> at
>> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>> at
>> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>> at
>>
>> org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1.apply(SQLContext.scala:1358)
>> at
>>
>> org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1.apply(SQLContext.scala:1356)
>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>> at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
>> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>> at
>> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>> at
>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>> at
>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>> at scala.collection.TraversableOnce$class.to
>> (TraversableOnce.scala:273)
>> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>> at
>> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>> at
>> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>> at
>>
>>

Re: using spark context in map funciton TASk not serilizable error

2016-01-20 Thread Giri P

method1 looks like this

reRDD.map(row =>method1(row,sc)).saveAsTextFile(outputDir)

reRDD has userId's

def method1(sc:SparkContext , userId: string){
sc.cassandraTable("Keyspace", "Table2").where("userid = ?" userId)
...do something

return "Test"
}

On Wed, Jan 20, 2016 at 11:00 AM, Shixiong(Ryan) Zhu <
shixi...@databricks.com> wrote:

> You should not use SparkContext or RDD directly in your closures.
>
> Could you show the codes of "method1"? Maybe you only needs join or
> something else. E.g.,
>
> val cassandraRDD = sc.cassandraTable("keySpace", "tableName")
> reRDD.join(cassandraRDD).map().saveAsTextFile(outputDir)
>
>
> On Tue, Jan 19, 2016 at 4:12 AM, Ricardo Paiva <
> ricardo.pa...@corp.globo.com> wrote:
>
>> Did you try SparkContext.getOrCreate() ?
>>
>> You don't need to pass the sparkContext to the map function, you can
>> retrieve it from the SparkContext singleton.
>>
>> Regards,
>>
>> Ricardo
>>
>>
>> On Mon, Jan 18, 2016 at 6:29 PM, gpatcham [via Apache Spark User List] 
>> <[hidden
>> email] > wrote:
>>
>>> Hi,
>>>
>>> I have a use case where I need to pass sparkcontext in map function
>>>
>>> reRDD.map(row =>method1(row,sc)).saveAsTextFile(outputDir)
>>>
>>> Method1 needs spark context to query cassandra. But I see below error
>>>
>>> java.io.NotSerializableException: org.apache.spark.SparkContext
>>>
>>> Is there a way we can fix this ?
>>>
>>> Thanks
>>>
>>> --
>>> If you reply to this email, your message will be added to the discussion
>>> below:
>>>
>>> http://apache-spark-user-list.1001560.n3.nabble.com/using-spark-context-in-map-funciton-TASk-not-serilizable-error-tp25998.html
>>> To start a new topic under Apache Spark User List, email [hidden email]
>>> 
>>> To unsubscribe from Apache Spark User List, click here.
>>> NAML
>>> 
>>>
>>
>>
>>
>> --
>> Ricardo Paiva
>> Big Data
>> *globo.com* 
>>
>> --
>> View this message in context: Re: using spark context in map funciton
>> TASk not serilizable error
>> 
>>
>> Sent from the Apache Spark User List mailing list archive
>>  at Nabble.com.
>>
>
>

Re: I need help mapping a PairRDD solution to Dataset

2016-01-20 Thread Michael Armbrust

Yeah, that tough.  Perhaps you could do something like a flatMap and emit
multiple virtual copies of each student for each region that is neighboring
their actual region.

On Wed, Jan 20, 2016 at 10:50 AM, Steve Lewis  wrote:

> Thanks - this helps a lot except for the issue of looking at schools in
> neighboring regions
>
> On Wed, Jan 20, 2016 at 10:43 AM, Michael Armbrust  > wrote:
>
>> The analog to PairRDD is a GroupedDataset (created by calling groupBy),
>> which offers similar functionality, but doesn't require you to construct
>> new object that are in the form of key/value pairs.  It doesn't matter if
>> they are complex objects, as long as you can create an encoder for them
>> (currently supported for JavaBeans and case classes, but support for custom
>> encoders is on the roadmap).  These encoders are responsible for both fast
>> serialization and providing a view of your object that looks like a row.
>>
>> Based on the description of your problem, it sounds like you can use
>> joinWith and just express the predicate as a column.
>>
>> import org.apache.spark.sql.functions._
>> ds1.as("child").joinWith(ds2.as("school"), expr("child.region =
>> school.region"))
>>
>> The as operation is only required if you need to differentiate columns on
>> either side that have the same name.
>>
>> Note that by defining the join condition as an expression instead of a
>> lambda function, we are giving Spark SQL more information about the join so
>> it can often do the comparison without needing to deserialize the object,
>> which overtime will let us put more optimizations into the engine.
>>
>> You can also do this using lambda functions if you want though:
>>
>> ds1.groupBy(_.region).cogroup(ds2.groupBy(_.region) { (key, iter1, iter2)
>> =>
>>   ...
>> }
>>
>>
>> On Wed, Jan 20, 2016 at 10:26 AM, Steve Lewis 
>> wrote:
>>
>>> We have been working a large search problem which we have been solving
>>> in the following ways.
>>>
>>> We have two sets of objects, say children and schools. The object is to
>>> find the closest school to each child. There is a distance measure but it
>>> is relatively expensive and would be very costly to apply to all pairs.
>>>
>>> However the map can be divided into regions. If we assume that the
>>> closest school to a child is in his region of a neighboring region we need
>>> only compute the distance between a child and all schools in his region and
>>> neighboring regions.
>>>
>>> We currently use paired RDDs and a join to do this assigning children to
>>> one region and schools to their own region and neighboring regions and then
>>> creating a join and computing distances. Note the real problem is more
>>> complex.
>>>
>>> I can create Datasets of the two types of objects but see no Dataset
>>> analog for a PairRDD. How could I map my solution using PairRDDs to
>>> Datasets - assume the two objects are relatively complex data types and do
>>> not look like SQL dataset rows?
>>>
>>>
>>>
>>
>
>
> --
> Steven M. Lewis PhD
> 4221 105th Ave NE
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Skype lordjoe_com
>
>

Re: Container exited with a non-zero exit code 1-SparkJOb on YARN

2016-01-20 Thread Shixiong(Ryan) Zhu

Could you share your log?

On Wed, Jan 20, 2016 at 5:37 AM, Siddharth Ubale <
siddharth.ub...@syncoms.com> wrote:

>
>
> Hi,
>
>
>
> I am running a Spark Job on the yarn cluster.
>
> The spark job is a spark streaming application which is reading JSON from
> a kafka topic , inserting the JSON values to hbase tables via Phoenix ,
> ands then sending out certain messages to a websocket if the JSON satisfies
> a certain criteria.
>
>
>
> My cluster is a 3 node cluster with 24GB ram and 24 cores in total.
>
>
>
> Now :
>
> 1. when I am submitting the job with 10GB memory, the application fails
> saying memory is insufficient to run the job
>
> 2. The job is submitted with 6G ram. However, it does not run successfully
> always.Common issues faced :
>
> a. Container exited with a non-zero exit code 1 , and
> after multiple such warning the job is finished.
>
> d. The failed job notifies that it was unable to find a
> file in HDFS which is something like _*hadoop_conf*_xx.zip
>
>
>
> Can someone pls let me know why am I seeing the above 2 issues.
>
>
>
> Thanks,
>
> Siddharth Ubale,
>
>
>

Re: updateStateByKey not persisting in Spark 1.5.1

2016-01-20 Thread Ted Yu

This is related:

SPARK-6847

FYI

On Wed, Jan 20, 2016 at 7:55 AM, Brian London 
wrote:

> I'm running a streaming job that has two calls to updateStateByKey.  When
> run in standalone mode both calls to updateStateByKey behave as expected.
> When run on a cluster, however, it appears that the first call is not being
> checkpointed as shown in this DAG image:
>
> http://i.imgur.com/zmQ8O2z.png
>
> The middle column continues to grow one level deeper every batch until I
> get a stack overflow error.  I'm guessing its a problem of the stateRDD not
> being persisted, but I can't imagine why they wouldn't be.  I thought
> updateStateByKey was supposed to just handle that for you internally.
>
> Any ideas?
>
> I'll post stack trace excperpts of the stack overflow if anyone is
> interested below:
>
> Job aborted due to stage failure: Task 7 in stage 195811.0 failed 4 times,
> most recent failure: Lost task 7.3 in stage 195811.0 (TID 213529,
> ip-10-168-177-216.ec2.internal): java.lang.StackOverflowError at
> java.lang.Exception.(Exception.java:102) at
> java.lang.ReflectiveOperationException.(ReflectiveOperationException.java:89)
> at
> java.lang.reflect.InvocationTargetException.(InvocationTargetException.java:72)
> at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606) at
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058) at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1897) at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997) at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921) at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> ...
>
> And
>
> scala.collection.immutable.$colon$colon in readObject at line 362
> scala.collection.immutable.$colon$colon in readObject at line 366
> scala.collection.immutable.$colon$colon in readObject at line 362
> scala.collection.immutable.$colon$colon in readObject at line 362
> scala.collection.immutable.$colon$colon in readObject at line 366
> scala.collection.immutable.$colon$colon in readObject at line 362
> scala.collection.immutable.$colon$colon in readObject at line 362
> ...
>
>

Re: using spark context in map funciton TASk not serilizable error

2016-01-20 Thread Shixiong(Ryan) Zhu

You should not use SparkContext or RDD directly in your closures.

Could you show the codes of "method1"? Maybe you only needs join or
something else. E.g.,

val cassandraRDD = sc.cassandraTable("keySpace", "tableName")
reRDD.join(cassandraRDD).map().saveAsTextFile(outputDir)


On Tue, Jan 19, 2016 at 4:12 AM, Ricardo Paiva  wrote:

> Did you try SparkContext.getOrCreate() ?
>
> You don't need to pass the sparkContext to the map function, you can
> retrieve it from the SparkContext singleton.
>
> Regards,
>
> Ricardo
>
>
> On Mon, Jan 18, 2016 at 6:29 PM, gpatcham [via Apache Spark User List] 
> <[hidden
> email] > wrote:
>
>> Hi,
>>
>> I have a use case where I need to pass sparkcontext in map function
>>
>> reRDD.map(row =>method1(row,sc)).saveAsTextFile(outputDir)
>>
>> Method1 needs spark context to query cassandra. But I see below error
>>
>> java.io.NotSerializableException: org.apache.spark.SparkContext
>>
>> Is there a way we can fix this ?
>>
>> Thanks
>>
>> --
>> If you reply to this email, your message will be added to the discussion
>> below:
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/using-spark-context-in-map-funciton-TASk-not-serilizable-error-tp25998.html
>> To start a new topic under Apache Spark User List, email [hidden email]
>> 
>> To unsubscribe from Apache Spark User List, click here.
>> NAML
>> 
>>
>
>
>
> --
> Ricardo Paiva
> Big Data
> *globo.com* 
>
> --
> View this message in context: Re: using spark context in map funciton
> TASk not serilizable error
> 
>
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>

visualize data from spark streaming

2016-01-20 Thread patcharee


Hi,

How to visualize realtime data (in graph/chart) from spark streaming? 
Any tools?


Best,
Patcharee

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: visualize data from spark streaming

2016-01-20 Thread Vinay Shukla

Or you can use Zeppelin notebook to visualize Spark Streaming. See
https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2hvcnRvbndvcmtzLWdhbGxlcnkvemVwcGVsaW4tbm90ZWJvb2tzL21hc3Rlci8yQjUyMlYzWDgvbm90ZS5qc29u

and other examples https://github.com/hortonworks-gallery/zeppelin-notebooks

On Wed, Jan 20, 2016 at 12:35 PM, Darren Govoni  wrote:

> Gotta roll your own. Look at kafka and websockets for example.
>
>
>
> Sent from my Verizon Wireless 4G LTE smartphone
>
>
>  Original message 
> From: patcharee 
> Date: 01/20/2016 2:54 PM (GMT-05:00)
> To: user@spark.apache.org
> Subject: visualize data from spark streaming
>
> Hi,
>
> How to visualize realtime data (in graph/chart) from spark streaming?
> Any tools?
>
> Best,
> Patcharee
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Cache table as

2016-01-20 Thread Younes Naguib

Hi all,

I'm connected to the thrift server using beeline  on Spark 1.6.
I used : cache table tbl as select * from table1
I see table1 in the storage memory. I can use it. But when I reconnect, I cant 
quert it anymore.
I get : Error: org.apache.spark.sql.AnalysisException: Table not found: table1; 
line 1 pos 43 (state=,code=0)

However, the table is still in the cache storage.

Any idea?


Younes

Re: Scala MatchError in Spark SQL

2016-01-20 Thread Andy Grove

Honestly, moving to Scala and using case classes is the path of least
resistance in the long term.



Thanks,

Andy.

--

Andy Grove
Chief Architect
AgilData - Simple Streaming SQL that Scales
www.agildata.com


On Wed, Jan 20, 2016 at 10:19 AM, Raghu Ganti  wrote:

> Thanks for your reply, Andy.
>
> Yes, that is what I concluded based on the Stack trace. The problem is
> stemming from Java implementation of generics, but I thought this will go
> away if you compiled against Java 1.8, which solves the issues of proper
> generic implementation.
>
> Any ideas?
>
> Also, are you saying that in order for my example to work, I would need to
> move to Scala and have the UDT implemented in Scala?
>
>
> On Wed, Jan 20, 2016 at 10:27 AM, Andy Grove 
> wrote:
>
>> Catalyst is expecting a class that implements scala.Row or scala.Product
>> and is instead finding a Java class. I've run into this issue a number of
>> times. Dataframe doesn't work so well with Java. Here's a blog post with
>> more information on this:
>>
>> http://www.agildata.com/apache-spark-rdd-vs-dataframe-vs-dataset/
>>
>>
>> Thanks,
>>
>> Andy.
>>
>> --
>>
>> Andy Grove
>> Chief Architect
>> AgilData - Simple Streaming SQL that Scales
>> www.agildata.com
>>
>>
>> On Wed, Jan 20, 2016 at 7:07 AM, raghukiran  wrote:
>>
>>> Hi,
>>>
>>> I created a custom UserDefinedType in Java as follows:
>>>
>>> SQLPoint = new UserDefinedType() {
>>> //overriding serialize, deserialize, sqlType, userClass functions here
>>> }
>>>
>>> When creating a dataframe, I am following the manual mapping, I have a
>>> constructor for JavaPoint - JavaPoint(double x, double y) and a Customer
>>> record as follows:
>>>
>>> public class CustomerRecord {
>>> private int id;
>>> private String name;
>>> private Object location;
>>>
>>> //setters and getters follow here
>>> }
>>>
>>> Following the example in Spark source, when I create a RDD as follows:
>>>
>>> sc.textFile(inputFileName).map(new Function() {
>>> //call method
>>> CustomerRecord rec = new CustomerRecord();
>>> rec.setLocation(SQLPoint.serialize(new JavaPoint(x, y)));
>>> });
>>>
>>> This results in a MatchError. The stack trace is as follows:
>>>
>>> scala.MatchError: [B@45aa3dd5 (of class [B)
>>> at
>>>
>>> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:255)
>>> at
>>>
>>> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
>>> at
>>>
>>> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>>> at
>>>
>>> org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
>>> at
>>>
>>> org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1358)
>>> at
>>>
>>> org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1358)
>>> at
>>>
>>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>>> at
>>>
>>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>>> at
>>>
>>> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>>> at
>>> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>>> at
>>> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>>> at
>>> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>>> at
>>>
>>> org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1.apply(SQLContext.scala:1358)
>>> at
>>>
>>> org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1.apply(SQLContext.scala:1356)
>>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>> at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
>>> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>>> at
>>> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>>> at
>>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>>> at
>>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>>> at scala.collection.TraversableOnce$class.to
>>> (TraversableOnce.scala:273)
>>> at

I need help mapping a PairRDD solution to Dataset

2016-01-20 Thread Steve Lewis

We have been working a large search problem which we have been solving in
the following ways.

We have two sets of objects, say children and schools. The object is to
find the closest school to each child. There is a distance measure but it
is relatively expensive and would be very costly to apply to all pairs.

However the map can be divided into regions. If we assume that the closest
school to a child is in his region of a neighboring region we need only
compute the distance between a child and all schools in his region and
neighboring regions.

We currently use paired RDDs and a join to do this assigning children to
one region and schools to their own region and neighboring regions and then
creating a join and computing distances. Note the real problem is more
complex.

I can create Datasets of the two types of objects but see no Dataset analog
for a PairRDD. How could I map my solution using PairRDDs to Datasets -
assume the two objects are relatively complex data types and do not look
like SQL dataset rows?

Re: trouble implementing complex transformer in java that can be used with Pipeline. Scala to Java porting problem

2016-01-20 Thread Andy Davidson

For clarity callUDF() is not defined on DataFrames. It is defined on
org.apache.spark.sql.functions . Strange the class name starts with lower
case. I have not figure out how to use function class.

http://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.
html

Andy

From:  Andrew Davidson 
Date:  Wednesday, January 20, 2016 at 4:05 PM
To:  "user @spark" 
Subject:  trouble implementing  complex transformer in java that can be used
with Pipeline. Scala to Java porting problem

> I am using 1.6.0. I am having trouble implementing a custom transformer
> derived from org.apache.spark.ml.Transformer in Java that I can use in a
> PipeLine.
> 
> So far the only way I figure out how to implement any kind of complex
> functionality and have it applied to a DataFrame is to implement a UDF. For
> example
> 
> 
>class StemmerUDF implements UDF1, Serializable {
> 
> private static final long serialVersionUID = 1L;
> 
> 
> 
> @Override
> 
> public List call(String text) throws Exception {
> 
> List ret = stemText(text); //call org.apache.lucene
> 
> return ret;
> 
> }
> 
> }
> 
> 
> 
> Before I can use the UDF it needs to be registered. This requires the
> sqlContext. The problem is sqlContext is not available during pipeline.load()
> 
> 
>void registerUDF(SQLContext sqlContext) {
> 
> if (udf == null) {
> 
> udf = new StemmerUDF();
> 
> DataType returnType =
> DataTypes.createArrayType(DataTypes.StringType);
> 
> sqlContext.udf().register(udfName, udf, returnType);
> 
> }
> 
> }
> 
> 
> Our transformer needs to implement transform(). For it to be able to use the
> registered UDF we need the sqlContext. The problem is the sqlContext is not
> part of the signature of transform. My current hack is to pass the sqlContext
> to the constructor and not to use pipelines
>   @Override
> 
> public DataFrame transform(DataFrame df) {
> 
>   String fmt = "%s(%s) as %s";
> 
> String stmt = String.format(fmt, udfName, inputCol, outputCol);
> 
> logger.info("\nstmt: {}", stmt);
> 
> DataFrame ret = df.selectExpr("*", stmt);
> 
> return ret;
> 
> }
> 
> 
> 
> Is they a way to do something like df.callUDF(myUDF);
> 
> 
> 
> The following Scala code looks like it is close to what I need. I not been
> able to figure out how do something like this in Java 8. callUDF does not seem
> to be avaliable.
> 
> 
> 
> spark/spark-1.6.0/mllib/src/main/scala/org/apache/spark/ml/Transformer.scala
> 
> @DeveloperApi
> 
> abstract class UnaryTransformer[IN, OUT, T <: UnaryTransformer[IN, OUT, T]]
> 
>   extends Transformer with HasInputCol with HasOutputCol with Logging {
> 
> 
> 
> . . .
> 
> 
> 
>  override def transform(dataset: DataFrame): DataFrame = {
> 
> transformSchema(dataset.schema, logging = true)
> 
> dataset.withColumn($(outputCol),
> 
>   callUDF(this.createTransformFunc, outputDataType, dataset($(inputCol
> 
>   }
> 
> 
> 
> spark/spark-1.6.0/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.s
> cala 
> 
> 
> 
> class Tokenizer(override val uid: String)
> 
>   extends UnaryTransformer[String, Seq[String], Tokenizer] with
> DefaultParamsWritable {
> 
> 
> 
> . . .
> 
>   override protected def createTransformFunc: String => Seq[String] = {
> 
> _.toLowerCase.split("\\s")
> 
>   }
> 
> . . .
> 
> }
> 
> 
> 
> Kind regards
> 
> 
> 
> Andy
> 
>

trouble implementing complex transformer in java that can be used with Pipeline. Scala to Java porting problem

2016-01-20 Thread Andy Davidson

I am using 1.6.0. I am having trouble implementing a custom transformer
derived from org.apache.spark.ml.Transformer in Java that I can use in a
PipeLine.

So far the only way I figure out how to implement any kind of complex
functionality and have it applied to a DataFrame is to implement a UDF. For
example


   class StemmerUDF implements UDF1, Serializable {

private static final long serialVersionUID = 1L;



@Override

public List call(String text) throws Exception {

List ret = stemText(text); //call org.apache.lucene

return ret;

}

}



Before I can use the UDF it needs to be registered. This requires the
sqlContext. The problem is sqlContext is not available during
pipeline.load() 


   void registerUDF(SQLContext sqlContext) {

if (udf == null) {

udf = new StemmerUDF();

DataType returnType =
DataTypes.createArrayType(DataTypes.StringType);

sqlContext.udf().register(udfName, udf, returnType);

}

}


Our transformer needs to implement transform(). For it to be able to use the
registered UDF we need the sqlContext. The problem is the sqlContext is not
part of the signature of transform. My current hack is to pass the
sqlContext to the constructor and not to use pipelines
  @Override

public DataFrame transform(DataFrame df) {

  String fmt = "%s(%s) as %s";

String stmt = String.format(fmt, udfName, inputCol, outputCol);

logger.info("\nstmt: {}", stmt);

DataFrame ret = df.selectExpr("*", stmt);

return ret;

}



Is they a way to do something like df.callUDF(myUDF);



The following Scala code looks like it is close to what I need. I not been
able to figure out how do something like this in Java 8. callUDF does not
seem to be avaliable.



spark/spark-1.6.0/mllib/src/main/scala/org/apache/spark/ml/Transformer.scala

@DeveloperApi

abstract class UnaryTransformer[IN, OUT, T <: UnaryTransformer[IN, OUT, T]]

  extends Transformer with HasInputCol with HasOutputCol with Logging {



. . .



 override def transform(dataset: DataFrame): DataFrame = {

transformSchema(dataset.schema, logging = true)

dataset.withColumn($(outputCol),

  callUDF(this.createTransformFunc, outputDataType,
dataset($(inputCol

  }



spark/spark-1.6.0/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer
.scala 



class Tokenizer(override val uid: String)

  extends UnaryTransformer[String, Seq[String], Tokenizer] with
DefaultParamsWritable {



. . .

  override protected def createTransformFunc: String => Seq[String] = {

_.toLowerCase.split("\\s")

  }

. . .

}



Kind regards



Andy

Re: trouble implementing complex transformer in java that can be used with Pipeline. Scala to Java porting problem

2016-01-20 Thread Kevin Mellott

Hi Andy,

According to the API documentation for DataFrame
,
you should have access to *sqlContext* as a property off of the DataFrame
instance. In your example, you could then do something like:

df.sqlContext.udf.register(...)

Thanks,
Kevin

On Wed, Jan 20, 2016 at 6:15 PM, Andy Davidson <
a...@santacruzintegration.com> wrote:

> For clarity callUDF() is not defined on DataFrames. It is defined on 
> org.apache.spark.sql.functions
> . Strange the class name starts with lower case. I have not figure out
> how to use function class.
>
>
> http://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html
>
> Andy
>
> From: Andrew Davidson 
> Date: Wednesday, January 20, 2016 at 4:05 PM
> To: "user @spark" 
> Subject: trouble implementing complex transformer in java that can be
> used with Pipeline. Scala to Java porting problem
>
> I am using 1.6.0. I am having trouble implementing a custom transformer
> derived from org.apache.spark.ml.Transformer in Java that I can use in
> a PipeLine.
>
> So far the only way I figure out how to implement any kind of complex
> functionality and have it applied to a DataFrame is to implement a UDF. For
> example
>
>
>class StemmerUDF implements UDF1, Serializable {
>
> private static final long serialVersionUID = 1L;
>
>
> @Override
>
> public List call(String text) throws Exception {
>
> List ret = stemText(text); //call org.apache.lucene
>
> return ret;
>
> }
>
> }
>
>
> Before I can use the UDF it needs to be registered. This requires the
> sqlContext. *The problem is sqlContext is not available during
> pipeline.load()*
>
>void registerUDF(SQLContext sqlContext) {
>
> if (udf == null) {
>
> udf = new StemmerUDF();
>
> DataType returnType = DataTypes.createArrayType(DataTypes.
> StringType);
>
> sqlContext.udf().register(udfName, udf, returnType);
>
> }
>
> }
>
>
> Our transformer needs to implement transform(). For it to be able to use
> the registered UDF we need the sqlContext. *The problem is the sqlContext
> is not part of the signature of transform.* My current hack is to pass
> the sqlContext to the constructor and not to use pipelines
>
>   @Override
>
> public DataFrame transform(DataFrame df) {
>
>   String fmt = "%s(%s) as %s";
>
> String stmt = String.format(fmt, udfName, inputCol, outputCol);
>
> logger.info("\nstmt: {}", stmt);
>
> DataFrame ret = df.selectExpr("*", stmt);
>
> return ret;
>
> }
>
>
> Is they a way to do something like df.callUDF(myUDF);
>
>
> *The following Scala code looks like it is close to what I need. I not
> been able to figure out how do something like this in Java 8. callUDF does
> not seem to be avaliable.*
>
>
>
> spark/spark-1.6.0/mllib/src/main/scala/org/apache/spark/ml/Transformer.scala
>
> @DeveloperApi
>
> abstract class UnaryTransformer[IN, OUT, T <: UnaryTransformer[IN, OUT,
> T]]
>
>   extends Transformer with HasInputCol with HasOutputCol with Logging {
>
> . . .
>
>
>  override def transform(dataset: DataFrame): DataFrame = {
>
> transformSchema(dataset.schema, logging = true)
>
> dataset.withColumn($(outputCol),
>
>   callUDF(this.createTransformFunc, outputDataType, dataset($(inputCol
> 
>
>   }
>
>
>
> spark/spark-1.6.0/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala
>
>
> class Tokenizer(override val uid: String)
>
>   extends UnaryTransformer[String, Seq[String], Tokenizer] with
> DefaultParamsWritable {
>
>
> . . .
>
>   override protected def createTransformFunc: String => Seq[String] = {
>
> _.toLowerCase.split("\\s")
>
>   }
>
> . . .
>
> }
>
>
> Kind regards
>
>
> Andy
>
>
>

Re: Re: spark dataframe jdbc read/write using dbcp connection pool

2016-01-20 Thread fightf...@163.com

OK. I am trying to use the jdbc read datasource with predicate like the 
following : 
   sqlContext.read.jdbc(url, table, Array("foo = 1", "foo = 3"), props)
I can see that the task goes to 62 partitions. But I still get exception and 
the parquet 
file did not write successfully. Do I need to increase the partitions? Or is 
there any other 
alternatives I can choose to tune this ? 

Best,
Sun.



fightf...@163.com
 
From: fightf...@163.com
Date: 2016-01-20 15:06
To: 刘虓
CC: user
Subject: Re: Re: spark dataframe jdbc read/write using dbcp connection pool
Hi,
Thanks a lot for your suggestion. I then tried the following code :
val prop = new java.util.Properties
prop.setProperty("user","test")
prop.setProperty("password", "test")
prop.setProperty("partitionColumn", "added_year")
prop.setProperty("lowerBound", "1985")
prop.setProperty("upperBound","2015")
prop.setProperty("numPartitions", "200")

val url1 = "jdbc:mysql://172.16.54.136:3306/db1"
val url2 = "jdbc:mysql://172.16.54.138:3306/db1"
val jdbcDF1 = sqlContext.read.jdbc(url1,"video3",prop)
val jdbcDF2 = sqlContext.read.jdbc(url2,"video3",prop)

val jdbcDF3 = jdbcDF1.unionAll(jdbcDF2)
jdbcDF3.write.format("parquet").save("hdfs://172.16.54.138:8020/perf4")

The added_year column in mysql table contains range of (1985-2015), and I pass 
the numPartitions property 
to get the partition purpose. Is this what you recommend ? Can you advice a 
little more implementation on this ? 

Best,
Sun.



fightf...@163.com
 
From: 刘虓
Date: 2016-01-20 11:26
To: fightf...@163.com
CC: user
Subject: Re: spark dataframe jdbc read/write using dbcp connection pool
Hi,
I suggest you partition the JDBC reading on a indexed column of the mysql table

2016-01-20 10:11 GMT+08:00 fightf...@163.com :
Hi , 
I want to load really large volumn datasets from mysql using spark dataframe 
api. And then save as 
parquet file or orc file to facilitate that with hive / Impala. The datasets 
size is about 1 billion records and 
when I am using the following naive code to run that , Error occurs and 
executor lost failure.

val prop = new java.util.Properties
prop.setProperty("user","test")
prop.setProperty("password", "test")

val url1 = "jdbc:mysql://172.16.54.136:3306/db1"
val url2 = "jdbc:mysql://172.16.54.138:3306/db1"
val jdbcDF1 = sqlContext.read.jdbc(url1,"video",prop)
val jdbcDF2 = sqlContext.read.jdbc(url2,"video",prop)

val jdbcDF3 = jdbcDF1.unionAll(jdbcDF2)
jdbcDF3.write.format("parquet").save("hdfs://172.16.54.138:8020/perf")

I can see from the executor log and the message is like the following. I can 
see from the log that the wait_timeout threshold reached 
and there is no retry mechanism in the code process. So I am asking you experts 
to help on tuning this. Or should I try to use a jdbc
connection pool to increase parallelism ? 

   16/01/19 17:04:28 ERROR executor.Executor: Exception in task 0.0 in stage 
0.0 (TID 0)
com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link 
failure

The last packet successfully received from the server was 377,769 milliseconds 
ago.  The last packet sent successfully to the server was 377,790 milliseconds 
ago.
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

Caused by: java.io.EOFException: Can not read response from server. Expected to 
read 4 bytes, read 1 bytes before connection was unexpectedly lost.
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2914)
at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:1996)
... 22 more
16/01/19 17:10:47 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 
4
16/01/19 17:10:47 INFO jdbc.JDBCRDD: closed connection
16/01/19 17:10:47 ERROR executor.Executor: Exception in task 1.1 in stage 0.0 
(TID 2)
com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link 
failure





fightf...@163.com

Re: spark-1.2.0--standalone-ha-zookeeper

2016-01-20 Thread Paul Leclercq

Hi Raghvendra and Spark users,

I also have trouble activating my stand by master when my first master is
shutdown (via a ./sbin/stop-master.sh or via a instance shut down) and just
want to share with you my thoughts.

To answer your question Raghvendra, in *spark-env.sh*, if 2 IPs are set
for SPARK_MASTER_IP(SPARK_MASTER_IP='W.X.Y.Z,A.B.C.D'), the standalone
cluster cannot be launched.

So I only use only one IP there, as the Spark context can know other
masters with a other way, as written in the Standalone Zookeeper HA

doc, "you might start your SparkContext pointing to
spark://host1:port1,host2:port2"

In my opinion, we should not have to set a SPARK_MASTER_IP as this is
stored in ZooKeeper :

you can launch multiple Masters in your cluster connected to the same
> ZooKeeper instance. One will be elected “leader” and the others will remain
> in standby mode.

When starting up, an application or Worker needs to be able to find and
> register with the current lead Master. Once it successfully registers,
> though, it is “in the system” (i.e., stored in ZooKeeper).

 -
http://spark.apache.org/docs/latest/spark-standalone.html#standby-masters-with-zookeeper

As I understand it, after a ./sbin/stop-master.sh on both master, a master
will be elected, and the other will be stand by.
To launch the workers, we can use ./sbin/start-slave.sh
spark://MASTER_ELECTED_IP:7077
I don't think if we can use the ./sbin/start-all.sh that use the salve file
to launch workers and masters as we cannot set 2 master IPs inside
spark-env.sh

My SPARK_DAEMON_JAVA_OPTS content :

SPARK_DAEMON_JAVA_OPTS='-Dspark.deploy.recoveryMode="ZOOKEEPER"
> -Dspark.deploy.zookeeper.url="ZOOKEEPER_IP:2181"
> -Dspark.deploy.zookeeper.dir="/spark"'


A good thing to check if everything went OK is the folder /spark on the
ZooKeeper server. I could not find it on my server.

Thanks for reading,

Paul


2016-01-19 22:12 GMT+01:00 Raghvendra Singh :

> Hi, there is one question. In spark-env.sh should i specify all masters
> for parameter SPARK_MASTER_IP. I've set SPARK_DAEMON_JAVA_OPTS already
> with zookeeper configuration as specified in spark documentation.
>
> Thanks & Regards
> Raghvendra
>
> On Wed, Jan 20, 2016 at 1:46 AM, Raghvendra Singh <
> raghvendra.ii...@gmail.com> wrote:
>
>> Here's the complete master log on reproducing the error
>> http://pastebin.com/2YJpyBiF
>>
>> Regards
>> Raghvendra
>>
>> On Wed, Jan 20, 2016 at 12:38 AM, Raghvendra Singh <
>> raghvendra.ii...@gmail.com> wrote:
>>
>>> Ok I Will try to reproduce the problem. Also I don't think this is an
>>> uncommon problem I am searching for this problem on Google for many days
>>> and found lots of questions but no answers.
>>>
>>> Do you know what kinds of settings spark and zookeeper allow for
>>> handling time outs during leader election etc. When one is down.
>>>
>>> Regards
>>> Raghvendra
>>> On 20-Jan-2016 12:28 am, "Ted Yu"  wrote:
>>>
 Perhaps I don't have enough information to make further progress.

 On Tue, Jan 19, 2016 at 10:55 AM, Raghvendra Singh <
 raghvendra.ii...@gmail.com> wrote:

> I currently do not have access to those logs but there were only about
> five lines before this error. They were the same which are present usually
> when everything works fine.
>
> Can you still help?
>
> Regards
> Raghvendra
> On 18-Jan-2016 8:50 pm, "Ted Yu"  wrote:
>
>> Can you pastebin master log before the error showed up ?
>>
>> The initial message was posted for Spark 1.2.0
>> Which release of Spark / zookeeper do you use ?
>>
>> Thanks
>>
>> On Mon, Jan 18, 2016 at 6:47 AM, doctorx 
>> wrote:
>>
>>> Hi,
>>> I am facing the same issue, with the given error
>>>
>>> ERROR Master:75 - Leadership has been revoked -- master shutting
>>> down.
>>>
>>> Can anybody help. Any clue will be useful. Should i change something
>>> in
>>> spark cluster or zookeeper. Is there any setting in spark which can
>>> help me?
>>>
>>> Thanks & Regards
>>> Raghvendra
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-2-0-standalone-ha-zookeeper-tp21308p25994.html
>>> Sent from the Apache Spark User List mailing list archive at
>>> Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>

>>
>

How to use scala.math.Ordering in java

2016-01-20 Thread ddav

Hi,

I am writing my Spark application in java and I need to use a
RangePartitioner. 

JavaPairRDD progRef1 =
sc.textFile(programReferenceDataFile, 12).filter(
(String s) -> !s.startsWith("#")).mapToPair(
(String s) -> {
ProgramDataRef ref = new 
ProgramDataRef(s);
return new Tuple2(ref.genre, ref);
}
);

RangePartitioner rangePart = new
RangePartitioner(12, progRef1.rdd(), true, ?,
progRef1.kClassTag());

I can't determine how to create the correct object for parameter 4 which is
"scala.math.Ordering evidence$1" from the documentation. From the
scala.math.Ordering code I see there are many implicit objects and one
handles Strings. How can I access them from Java.

Thanks,
Dave.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-scala-math-Ordering-in-java-tp26019.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Redundant common columns of nature full outer join

2016-01-20 Thread Michael Armbrust

If you use the join that takes USING columns it should automatically
coalesce (take the non null value from) the left/right columns:

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L405

On Tue, Jan 19, 2016 at 10:51 PM, Zhong Wang 
wrote:

> Hi all,
>
> I am joining two tables with common columns using full outer join.
> However, the current Dataframe API doesn't support nature joins, so the
> output contains redundant common columns from both of the tables.
>
> Is there any way to remove these redundant columns for a "nature" full
> outer join? For a left outer join or right outer join, I can select just
> the common columns from the left table or the right table. However, for a
> full outer join, it seems it is quite difficult to do that, because there
> are null values in both of the left and right common columns.
>
>
> Thanks,
> Zhong
>

Re: Re: spark dataframe jdbc read/write using dbcp connection pool

2016-01-20 Thread 刘虓

Hi,
I think you can view the spark job ui to find out whether the partition
works or not,pay attention to the storage page to the partition size and
which stage / task fails

2016-01-20 16:25 GMT+08:00 fightf...@163.com :

> OK. I am trying to use the jdbc read datasource with predicate like the
> following :
>sqlContext.read.jdbc(url, table, Array("foo = 1", "foo = 3"), props)
> I can see that the task goes to 62 partitions. But I still get exception
> and the parquet
> file did not write successfully. Do I need to increase the partitions? Or
> is there any other
> alternatives I can choose to tune this ?
>
> Best,
> Sun.
>
> --
> fightf...@163.com
>
>
> *From:* fightf...@163.com
> *Date:* 2016-01-20 15:06
> *To:* 刘虓 
> *CC:* user 
> *Subject:* Re: Re: spark dataframe jdbc read/write using dbcp connection
> pool
> Hi,
> Thanks a lot for your suggestion. I then tried the following code :
> val prop = new java.util.Properties
> prop.setProperty("user","test")
> prop.setProperty("password", "test")
> prop.setProperty("partitionColumn", "added_year")
> prop.setProperty("lowerBound", "1985")
> prop.setProperty("upperBound","2015")
> prop.setProperty("numPartitions", "200")
>
> val url1 = "jdbc:mysql://172.16.54.136:3306/db1"
> val url2 = "jdbc:mysql://172.16.54.138:3306/db1"
> val jdbcDF1 = sqlContext.read.jdbc(url1,"video3",prop)
> val jdbcDF2 = sqlContext.read.jdbc(url2,"video3",prop)
>
> val jdbcDF3 = jdbcDF1.unionAll(jdbcDF2)
> jdbcDF3.write.format("parquet").save("hdfs://172.16.54.138:8020/perf4
> ")
>
> The added_year column in mysql table contains range of (1985-2015), and I
> pass the numPartitions property
> to get the partition purpose. Is this what you recommend ? Can you advice
> a little more implementation on this ?
>
> Best,
> Sun.
>
> --
> fightf...@163.com
>
>
> *From:* 刘虓 
> *Date:* 2016-01-20 11:26
> *To:* fightf...@163.com
> *CC:* user 
> *Subject:* Re: spark dataframe jdbc read/write using dbcp connection pool
> Hi,
> I suggest you partition the JDBC reading on a indexed column of the mysql
> table
>
> 2016-01-20 10:11 GMT+08:00 fightf...@163.com :
>
>> Hi ,
>> I want to load really large volumn datasets from mysql using spark
>> dataframe api. And then save as
>> parquet file or orc file to facilitate that with hive / Impala. The
>> datasets size is about 1 billion records and
>> when I am using the following naive code to run that , Error occurs and
>> executor lost failure.
>>
>> val prop = new java.util.Properties
>> prop.setProperty("user","test")
>> prop.setProperty("password", "test")
>>
>> val url1 = "jdbc:mysql://172.16.54.136:3306/db1"
>> val url2 = "jdbc:mysql://172.16.54.138:3306/db1"
>> val jdbcDF1 = sqlContext.read.jdbc(url1,"video",prop)
>> val jdbcDF2 = sqlContext.read.jdbc(url2,"video",prop)
>>
>> val jdbcDF3 = jdbcDF1.unionAll(jdbcDF2)
>> jdbcDF3.write.format("parquet").save("hdfs://172.16.54.138:8020/perf
>> ")
>>
>> I can see from the executor log and the message is like the following. I
>> can see from the log that the wait_timeout threshold reached
>> and there is no retry mechanism in the code process. So I am asking you
>> experts to help on tuning this. Or should I try to use a jdbc
>> connection pool to increase parallelism ?
>>
>>
>> 16/01/19 17:04:28 ERROR executor.Executor: Exception in task 0.0 in stage 
>> 0.0 (TID 0)
>>
>> com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link 
>> failure
>>
>>
>> The last packet successfully received from the server was 377,769 
>> milliseconds ago.  The last packet sent successfully to the server was 
>> 377,790 milliseconds ago.
>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>
>> Caused by:
>> java.io.EOFException: Can not read response from server. Expected to read 4 
>> bytes, read 1 bytes before connection was unexpectedly lost.
>> at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2914)
>> at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:1996)
>> ... 22 more
>>
>> 16/01/19 17:10:47 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
>> task 4
>> 16/01/19 17:10:47 INFO jdbc.JDBCRDD: closed connection
>>
>> 16/01/19 17:10:47 ERROR executor.Executor: Exception in task 1.1 in stage 
>> 0.0 (TID 2)
>>
>> com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link 
>> failure
>>
>>
>>
>> --
>> fightf...@163.com
>>
>
>

Re: Re: spark dataframe jdbc read/write using dbcp connection pool

2016-01-20 Thread fightf...@163.com

OK. I see there actually goes more partitions when I use predicate from the 
spark job ui. But each task then failed 
with the same error message :

   com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link 
failure

The last packet successfully received from the server was 377,769 milliseconds 
ago.  The last packet sent successfully to the server was 377,790 milliseconds 
ago.

Do I need to increase the partitions ? Or shall I write parquet file for each 
partition in a iterable way ? 

Thanks a lot for your advice.

Best,
Sun.



fightf...@163.com
 
From: 刘虓
Date: 2016-01-20 18:31
To: fightf...@163.com
CC: user
Subject: Re: Re: spark dataframe jdbc read/write using dbcp connection pool
Hi,
I think you can view the spark job ui to find out whether the partition works 
or not,pay attention to the storage page to the partition size and which stage 
/ task fails

2016-01-20 16:25 GMT+08:00 fightf...@163.com :
OK. I am trying to use the jdbc read datasource with predicate like the 
following : 
   sqlContext.read.jdbc(url, table, Array("foo = 1", "foo = 3"), props)
I can see that the task goes to 62 partitions. But I still get exception and 
the parquet 
file did not write successfully. Do I need to increase the partitions? Or is 
there any other 
alternatives I can choose to tune this ? 

Best,
Sun.



fightf...@163.com
 
From: fightf...@163.com
Date: 2016-01-20 15:06
To: 刘虓
CC: user
Subject: Re: Re: spark dataframe jdbc read/write using dbcp connection pool
Hi,
Thanks a lot for your suggestion. I then tried the following code :
val prop = new java.util.Properties
prop.setProperty("user","test")
prop.setProperty("password", "test")
prop.setProperty("partitionColumn", "added_year")
prop.setProperty("lowerBound", "1985")
prop.setProperty("upperBound","2015")
prop.setProperty("numPartitions", "200")

val url1 = "jdbc:mysql://172.16.54.136:3306/db1"
val url2 = "jdbc:mysql://172.16.54.138:3306/db1"
val jdbcDF1 = sqlContext.read.jdbc(url1,"video3",prop)
val jdbcDF2 = sqlContext.read.jdbc(url2,"video3",prop)

val jdbcDF3 = jdbcDF1.unionAll(jdbcDF2)
jdbcDF3.write.format("parquet").save("hdfs://172.16.54.138:8020/perf4")

The added_year column in mysql table contains range of (1985-2015), and I pass 
the numPartitions property 
to get the partition purpose. Is this what you recommend ? Can you advice a 
little more implementation on this ? 

Best,
Sun.



fightf...@163.com
 
From: 刘虓
Date: 2016-01-20 11:26
To: fightf...@163.com
CC: user
Subject: Re: spark dataframe jdbc read/write using dbcp connection pool
Hi,
I suggest you partition the JDBC reading on a indexed column of the mysql table

2016-01-20 10:11 GMT+08:00 fightf...@163.com :
Hi , 
I want to load really large volumn datasets from mysql using spark dataframe 
api. And then save as 
parquet file or orc file to facilitate that with hive / Impala. The datasets 
size is about 1 billion records and 
when I am using the following naive code to run that , Error occurs and 
executor lost failure.

val prop = new java.util.Properties
prop.setProperty("user","test")
prop.setProperty("password", "test")

val url1 = "jdbc:mysql://172.16.54.136:3306/db1"
val url2 = "jdbc:mysql://172.16.54.138:3306/db1"
val jdbcDF1 = sqlContext.read.jdbc(url1,"video",prop)
val jdbcDF2 = sqlContext.read.jdbc(url2,"video",prop)

val jdbcDF3 = jdbcDF1.unionAll(jdbcDF2)
jdbcDF3.write.format("parquet").save("hdfs://172.16.54.138:8020/perf")

I can see from the executor log and the message is like the following. I can 
see from the log that the wait_timeout threshold reached 
and there is no retry mechanism in the code process. So I am asking you experts 
to help on tuning this. Or should I try to use a jdbc
connection pool to increase parallelism ? 

   16/01/19 17:04:28 ERROR executor.Executor: Exception in task 0.0 in stage 
0.0 (TID 0)
com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link 
failure

The last packet successfully received from the server was 377,769 milliseconds 
ago.  The last packet sent successfully to the server was 377,790 milliseconds 
ago.
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

Caused by: java.io.EOFException: Can not read response from server. Expected to 
read 4 bytes, read 1 bytes before connection was unexpectedly lost.
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2914)
at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:1996)
... 22 more
16/01/19 17:10:47 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 
4
16/01/19 17:10:47 INFO jdbc.JDBCRDD: closed connection
16/01/19 17:10:47 ERROR executor.Executor: Exception in task 1.1 in stage 0.0 
(TID 2)
com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link 
failure





fightf...@163.com

How to query data in tachyon with spark-sql

2016-01-20 Thread Sea

Hi,all  I want to mount some hive table in tachyon, but I don't know how to 
query data in tachyon with spark-sql, who knows?

Re: spark-1.2.0--standalone-ha-zookeeper

2016-01-20 Thread Raghvendra Singh

Thanks Paul,
So your reply prevented me from looking in the wrong direction, but I an
back to my original problem with zookeeper.
"Leadership had been revoked master shutting down"

Can anyone provide some feedback or add to this.
Regards
Raghvendra
On 20-Jan-2016 2:31 pm, "Paul Leclercq"  wrote:

> Hi Raghvendra and Spark users,
>
> I also have trouble activating my stand by master when my first master is
> shutdown (via a ./sbin/stop-master.sh or via a instance shut down) and
> just want to share with you my thoughts.
>
> To answer your question Raghvendra, in *spark-env.sh*, if 2 IPs are set
> for SPARK_MASTER_IP(SPARK_MASTER_IP='W.X.Y.Z,A.B.C.D'), the standalone
> cluster cannot be launched.
>
> So I only use only one IP there, as the Spark context can know other
> masters with a other way, as written in the Standalone Zookeeper HA
> 
> doc, "you might start your SparkContext pointing to
> spark://host1:port1,host2:port2"
>
> In my opinion, we should not have to set a SPARK_MASTER_IP as this is
> stored in ZooKeeper :
>
> you can launch multiple Masters in your cluster connected to the same
>> ZooKeeper instance. One will be elected “leader” and the others will remain
>> in standby mode.
>
> When starting up, an application or Worker needs to be able to find and
>> register with the current lead Master. Once it successfully registers,
>> though, it is “in the system” (i.e., stored in ZooKeeper).
>
>  -
> http://spark.apache.org/docs/latest/spark-standalone.html#standby-masters-with-zookeeper
>
> As I understand it, after a ./sbin/stop-master.sh on both master, a
> master will be elected, and the other will be stand by.
> To launch the workers, we can use ./sbin/start-slave.sh
> spark://MASTER_ELECTED_IP:7077
> I don't think if we can use the ./sbin/start-all.sh that use the salve
> file to launch workers and masters as we cannot set 2 master IPs inside
> spark-env.sh
>
> My SPARK_DAEMON_JAVA_OPTS content :
>
> SPARK_DAEMON_JAVA_OPTS='-Dspark.deploy.recoveryMode="ZOOKEEPER"
>> -Dspark.deploy.zookeeper.url="ZOOKEEPER_IP:2181"
>> -Dspark.deploy.zookeeper.dir="/spark"'
>
>
> A good thing to check if everything went OK is the folder /spark on the
> ZooKeeper server. I could not find it on my server.
>
> Thanks for reading,
>
> Paul
>
>
> 2016-01-19 22:12 GMT+01:00 Raghvendra Singh :
>
>> Hi, there is one question. In spark-env.sh should i specify all masters
>> for parameter SPARK_MASTER_IP. I've set SPARK_DAEMON_JAVA_OPTS already
>> with zookeeper configuration as specified in spark documentation.
>>
>> Thanks & Regards
>> Raghvendra
>>
>> On Wed, Jan 20, 2016 at 1:46 AM, Raghvendra Singh <
>> raghvendra.ii...@gmail.com> wrote:
>>
>>> Here's the complete master log on reproducing the error
>>> http://pastebin.com/2YJpyBiF
>>>
>>> Regards
>>> Raghvendra
>>>
>>> On Wed, Jan 20, 2016 at 12:38 AM, Raghvendra Singh <
>>> raghvendra.ii...@gmail.com> wrote:
>>>
 Ok I Will try to reproduce the problem. Also I don't think this is an
 uncommon problem I am searching for this problem on Google for many days
 and found lots of questions but no answers.

 Do you know what kinds of settings spark and zookeeper allow for
 handling time outs during leader election etc. When one is down.

 Regards
 Raghvendra
 On 20-Jan-2016 12:28 am, "Ted Yu"  wrote:

> Perhaps I don't have enough information to make further progress.
>
> On Tue, Jan 19, 2016 at 10:55 AM, Raghvendra Singh <
> raghvendra.ii...@gmail.com> wrote:
>
>> I currently do not have access to those logs but there were only
>> about five lines before this error. They were the same which are present
>> usually when everything works fine.
>>
>> Can you still help?
>>
>> Regards
>> Raghvendra
>> On 18-Jan-2016 8:50 pm, "Ted Yu"  wrote:
>>
>>> Can you pastebin master log before the error showed up ?
>>>
>>> The initial message was posted for Spark 1.2.0
>>> Which release of Spark / zookeeper do you use ?
>>>
>>> Thanks
>>>
>>> On Mon, Jan 18, 2016 at 6:47 AM, doctorx >> > wrote:
>>>
 Hi,
 I am facing the same issue, with the given error

 ERROR Master:75 - Leadership has been revoked -- master shutting
 down.

 Can anybody help. Any clue will be useful. Should i change
 something in
 spark cluster or zookeeper. Is there any setting in spark which can
 help me?

 Thanks & Regards
 Raghvendra



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-2-0-standalone-ha-zookeeper-tp21308p25994.html
 Sent from the Apache Spark

Container exited with a non-zero exit code 1-SparkJOb on YARN

2016-01-20 Thread Siddharth Ubale

Hi,

I am running a Spark Job on the yarn cluster.
The spark job is a spark streaming application which is reading JSON from a 
kafka topic , inserting the JSON values to hbase tables via Phoenix , ands then 
sending out certain messages to a websocket if the JSON satisfies a certain 
criteria.

My cluster is a 3 node cluster with 24GB ram and 24 cores in total.

Now :
1. when I am submitting the job with 10GB memory, the application fails saying 
memory is insufficient to run the job
2. The job is submitted with 6G ram. However, it does not run successfully 
always.Common issues faced :
a. Container exited with a non-zero exit code 1 , and after 
multiple such warning the job is finished.
d. The failed job notifies that it was unable to find a file in 
HDFS which is something like _hadoop_conf_xx.zip

Can someone pls let me know why am I seeing the above 2 issues.

Thanks,
Siddharth Ubale,

Container exited with a non-zero exit code 1-SparkJOb on YARN

2016-01-20 Thread Siddharth Ubale


Hi,

I am running a Spark Job on the yarn cluster.
The spark job is a spark streaming application which is reading JSON from a 
kafka topic , inserting the JSON values to hbase tables via Phoenix , ands then 
sending out certain messages to a websocket if the JSON satisfies a certain 
criteria.

My cluster is a 3 node cluster with 24GB ram and 24 cores in total.

Now :
1. when I am submitting the job with 10GB memory, the application fails saying 
memory is insufficient to run the job
2. The job is submitted with 6G ram. However, it does not run successfully 
always.Common issues faced :
a. Container exited with a non-zero exit code 1 , and after 
multiple such warning the job is finished.
d. The failed job notifies that it was unable to find a file in 
HDFS which is something like _hadoop_conf_xx.zip

Can someone pls let me know why am I seeing the above 2 issues.

Thanks,
Siddharth Ubale,

Re: Appending filename information to RDD initialized by sc.textFile

2016-01-20 Thread Femi Anthony

Thanks, I'll take a look.

On Wed, Jan 20, 2016 at 1:38 AM, Akhil Das 
wrote:

> You can use the sc.newAPIHadoopFile and pass your own InputFormat and
> RecordReader which will read the compressed .gz files to your usecase. For
> a start, you can look at the:
>
> - wholeTextFile implementation
> 
> - WholeTextFileInputFormat
> 
> - WholeTextFileRecordReader
> 
>
>
>
>
>
> Thanks
> Best Regards
>
> On Tue, Jan 19, 2016 at 11:48 PM, Femi Anthony  wrote:
>
>>
>>
>>  I  have a set of log files I would like to read into an RDD. These
>> files are all compressed .gz and are the filenames are date stamped. The
>> source of these files is the page view statistics data for wikipedia
>>
>> http://dumps.wikimedia.org/other/pagecounts-raw/
>>
>> The file names look like this:
>>
>> pagecounts-20090501-00.gz
>> pagecounts-20090501-01.gz
>> pagecounts-20090501-02.gz
>>
>> What I would like to do is read in all such files in a directory and
>> prepend the date from the filename (e.g. 20090501) to each row of the
>> resulting RDD. I first thought of using *sc.wholeTextFiles(..)* instead
>> of *sc.textFile(..)*, which creates a PairRDD with the key being the
>> file name with a path, but*sc.wholeTextFiles()* doesn't handle
>> compressed .gz files.
>>
>> Any suggestions would be welcome.
>>
>> --
>> http://www.femibyte.com/twiki5/bin/view/Tech/
>> http://www.nextmatrix.com
>> "Great spirits have always encountered violent opposition from mediocre
>> minds." - Albert Einstein.
>>
>
>


-- 
http://www.femibyte.com/twiki5/bin/view/Tech/
http://www.nextmatrix.com
"Great spirits have always encountered violent opposition from mediocre
minds." - Albert Einstein.

Scala MatchError in Spark SQL

2016-01-20 Thread raghukiran

Hi,

I created a custom UserDefinedType in Java as follows:

SQLPoint = new UserDefinedType() {
//overriding serialize, deserialize, sqlType, userClass functions here
}

When creating a dataframe, I am following the manual mapping, I have a
constructor for JavaPoint - JavaPoint(double x, double y) and a Customer
record as follows:

public class CustomerRecord {
private int id;
private String name;
private Object location;

//setters and getters follow here
}

Following the example in Spark source, when I create a RDD as follows:

sc.textFile(inputFileName).map(new Function() {
//call method
CustomerRecord rec = new CustomerRecord();
rec.setLocation(SQLPoint.serialize(new JavaPoint(x, y)));
});

This results in a MatchError. The stack trace is as follows:

scala.MatchError: [B@45aa3dd5 (of class [B)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:255)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
at
org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1358)
at
org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1358)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at
org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1.apply(SQLContext.scala:1358)
at
org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1.apply(SQLContext.scala:1356)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Scala-MatchError-in-Spark-SQL-tp26021.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

70 matches

Mail list logo