Bug of PolynomialExpansion ?

2016-05-29 Thread Jeff Zhang
I use PolynomialExpansion to convert one vector to 2-degree vector. I am
confused about the result of following. As my understanding, the 2 degrees
vector should contain 4 1's, not sure how the 5 1's come from. I think it
supposed to be (x1,x2,x3) *(x1,x2,x3) = (x1*x1, x1*x2, x1*x3, x2*x1,x2*x2,
x2*x3, x3*x1, x3*x2,x3*x3)

(3,[0,2],[1.0,1.0])  -->
(9,[0,1,5,6,8],[1.0,1.0,1.0,1.0,1.0])|


-- 
Best Regards

Jeff Zhang


Re: GraphX Java API

2016-05-29 Thread Takeshi Yamamuro
These package are used only for Scala.

On Mon, May 30, 2016 at 2:23 PM, Kumar, Abhishek (US - Bengaluru) <
abhishekkuma...@deloitte.com> wrote:

> Hey,
>
> ·   I see some graphx packages listed here:
>
> http://spark.apache.org/docs/latest/api/java/index.html
>
> ·   org.apache.spark.graphx
> 
>
> ·   org.apache.spark.graphx.impl
> 
>
> ·   org.apache.spark.graphx.lib
> 
>
> ·   org.apache.spark.graphx.util
> 
>
> Aren’t they meant to be used with JAVA?
>
> Thanks
>
>
>
> *From:* Santoshakhilesh [mailto:santosh.akhil...@huawei.com]
> *Sent:* Friday, May 27, 2016 4:52 PM
> *To:* Kumar, Abhishek (US - Bengaluru) ;
> user@spark.apache.org
> *Subject:* RE: GraphX Java API
>
>
>
> GraphX APis are available only in Scala. If you need to use GraphX you
> need to switch to Scala.
>
>
>
> *From:* Kumar, Abhishek (US - Bengaluru) [
> mailto:abhishekkuma...@deloitte.com ]
> *Sent:* 27 May 2016 19:59
> *To:* user@spark.apache.org
> *Subject:* GraphX Java API
>
>
>
> Hi,
>
>
>
> We are trying to consume the Java API for GraphX, but there is no
> documentation available online on the usage or examples. It would be great
> if we could get some examples in Java.
>
>
>
> Thanks and regards,
>
>
>
> *Abhishek Kumar*
>
>
>
>
>
>
>
> This message (including any attachments) contains confidential information
> intended for a specific individual and purpose, and is protected by law. If
> you are not the intended recipient, you should delete this message and any
> disclosure, copying, or distribution of this message, or the taking of any
> action based on it, by you is strictly prohibited.
>
> v.E.1
>
>
>
>
>
>
>
>
>



-- 
---
Takeshi Yamamuro


RE: GraphX Java API

2016-05-29 Thread Kumar, Abhishek (US - Bengaluru)
Hey,
·   I see some graphx packages listed here:
http://spark.apache.org/docs/latest/api/java/index.html
·   
org.apache.spark.graphx
·   
org.apache.spark.graphx.impl
·   
org.apache.spark.graphx.lib
·   
org.apache.spark.graphx.util
Aren’t they meant to be used with JAVA?
Thanks

From: Santoshakhilesh [mailto:santosh.akhil...@huawei.com]
Sent: Friday, May 27, 2016 4:52 PM
To: Kumar, Abhishek (US - Bengaluru) ; 
user@spark.apache.org
Subject: RE: GraphX Java API

GraphX APis are available only in Scala. If you need to use GraphX you need to 
switch to Scala.

From: Kumar, Abhishek (US - Bengaluru) [mailto:abhishekkuma...@deloitte.com]
Sent: 27 May 2016 19:59
To: user@spark.apache.org
Subject: GraphX Java API

Hi,

We are trying to consume the Java API for GraphX, but there is no documentation 
available online on the usage or examples. It would be great if we could get 
some examples in Java.

Thanks and regards,

Abhishek Kumar






This message (including any attachments) contains confidential information 
intended for a specific individual and purpose, and is protected by law. If you 
are not the intended recipient, you should delete this message and any 
disclosure, copying, or distribution of this message, or the taking of any 
action based on it, by you is strictly prohibited.

v.E.1









Preview release of Spark 2.0

2016-05-29 Thread charles li
Here is the link: http://spark.apache.org/news/spark-2.0.0-preview.html

congrats, haha, looking forward to 2.0.1, awesome project.


-- 
*--*
a spark lover, a quant, a developer and a good man.

http://github.com/litaotao


Re: 答复: G1 GC takes too much time

2016-05-29 Thread Ted Yu
Please consider reading G1GC tuning guide(s).
Here is an example:

http://product.hubspot.com/blog/g1gc-tuning-your-hbase-cluster

On Sun, May 29, 2016 at 7:17 PM, condor join 
wrote:

> The follwing are the parameters:
> -XX:+UseG1GC
> -XX:+UnlockDiagnostivVMOptions
> -XX:G1SummarizeConcMark
> -XX:InitiatingHeapOccupancyPercent=35
> spark.executor.memory=4G
>
> --
> *发件人:* Ted Yu 
> *发送时间:* 2016年5月30日 9:47:05
> *收件人:* condor join
> *抄送:* user@spark.apache.org
> *主题:* Re: G1 GC takes too much time
>
> bq. It happens during the Reduce majority.
>
> Did the above refer to reduce operation ?
>
> Can you share your G1GC parameters (and heap size for workers) ?
>
> Thanks
>
> On Sun, May 29, 2016 at 6:15 PM, condor join 
> wrote:
>
>> Hi,
>> my spark application failed due to take too much time during GC. Looking
>> at the logs I found these things:
>> 1.there are Young GC takes too much time,and not found Full GC happen
>> this;
>> 2.The time takes too much during the object copy;
>> 3.It happened  more easily when there were not enough resources;
>> 4.It happens during the Reduce majority.
>>
>> have anyone met the same question?
>> thanks
>>
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>
>


??????????: G1 GC takes too much time

2016-05-29 Thread Sea
Yes, It seems like that CMS is better. I have tried G1 as databricks' blog 
recommended, but it's too slow.




--  --
??: "condor join";;
: 2016??5??30??(??) 10:17
??: "Ted Yu"; 
: "user@spark.apache.org"; 
: : G1 GC takes too much time



  The follwing are the parameters:
 -XX:+UseG1GC -XX:+UnlockDiagnostivVMOptions -XX:G1SummarizeConcMark
 -XX:InitiatingHeapOccupancyPercent=35
 
 spark.executor.memory=4G
 
 
 
 
 ??: Ted Yu 
 : 2016??5??30?? 9:47:05
 ??: condor join
 : user@spark.apache.org
 : Re: G1 GC takes too much time  
 
  bq. It happens during the Reduce majority. 
 
 Did the above refer to reduce operation ?
 
 
 Can you share your G1GC parameters (and heap size for workers) ?
 
 
 Thanks
 
 
 On Sun, May 29, 2016 at 6:15 PM, condor join   wrote:
Hi, my spark application failed due to take too much time during GC. 
Looking at the logs I found these things:
 1.there are Young GC takes too much time,and not found Full GC happen this;
 2.The time takes too much during the object copy;
 3.It happened  more easily when there were not enough resources;
 4.It happens during the Reduce majority.
 
 
 have anyone met the same question?
 thanks
 
 
 
 
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

G1 GC takes too much time

2016-05-29 Thread condor join
Hi,
my spark application failed due to take too much time during GC. Looking at the 
logs I found these things:
1.there are Young GC takes too much time,and not found Full GC happen this;
2.The time takes too much during the object copy;
3.It happened  more easily when there were not enough resources;
4.It happens during the Reduce majority.

have anyone met the same question?
thanks


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Bulk loading Serialized RDD into Hbase throws KryoException - IndexOutOfBoundsException

2016-05-29 Thread Nirav Patel
Sure let me can try that. But from looks of it it seems kryo kryo.
util.MapReferenceResolver.getReadObject trying to access incorrect index
(100)

On Sun, May 29, 2016 at 5:06 PM, Ted Yu  wrote:

> Can you register Put with Kryo ?
>
> Thanks
>
> On May 29, 2016, at 4:58 PM, Nirav Patel  wrote:
>
> I pasted code snipped for that method.
>
> here's full def:
>
>   def writeRddToHBase2(hbaseRdd: RDD[(ImmutableBytesWritable, Put)],
> tableName: String) {
>
>
> hbaseRdd.values.foreachPartition{ itr =>
>
> val hConf = HBaseConfiguration.create()
>
> hConf.setInt("hbase.client.write.buffer", 16097152)
>
> val table = new HTable(hConf, tableName)
>
> //table.setWriteBufferSize(8388608)
>
> *itr.grouped(100).foreach(table.put(_)) *  // << Exception
> happens at this point
>
> table.close()
>
> }
>
>   }
>
>
> I am using hbase 0.98.12 mapr distribution.
>
>
> Thanks
>
> Nirav
>
> On Sun, May 29, 2016 at 4:46 PM, Ted Yu  wrote:
>
>> bq.  at com.mycorpt.myprojjobs.spark.jobs.hbase.HbaseUtils$$
>> anonfun$writeRddToHBase2$1.apply(HbaseUtils.scala:80)
>>
>> Can you reveal related code from HbaseUtils.scala ?
>>
>> Which hbase version are you using ?
>>
>> Thanks
>>
>> On Sun, May 29, 2016 at 4:26 PM, Nirav Patel 
>> wrote:
>>
>>> Hi,
>>>
>>> I am getting following Kryo deserialization error when trying to
>>> buklload Cached RDD into Hbase. It works if I don't cache the RDD. I cache
>>> it with MEMORY_ONLY_SER.
>>>
>>> here's the code snippet:
>>>
>>>
>>> hbaseRdd.values.foreachPartition{ itr =>
>>> val hConf = HBaseConfiguration.create()
>>> hConf.setInt("hbase.client.write.buffer", 16097152)
>>> val table = new HTable(hConf, tableName)
>>> itr.grouped(100).foreach(table.put(_))
>>> table.close()
>>> }
>>> hbaseRdd is of type RDD[(ImmutableBytesWritable, Put)]
>>>
>>>
>>> Exception I am getting. I read on Kryo JIRA that this may be issue with
>>> incorrect use of serialization library. So could this be issue with
>>> twitter-chill library or spark core it self ?
>>>
>>> Job aborted due to stage failure: Task 16 in stage 9.0 failed 10 times,
>>> most recent failure: Lost task 16.9 in stage 9.0 (TID 28614,
>>> hdn10.mycorptcorporation.local): com.esotericsoftware.kryo.KryoException:
>>> java.lang.IndexOutOfBoundsException: Index: 100, Size: 6
>>> Serialization trace:
>>> familyMap (org.apache.hadoop.hbase.client.Put)
>>> at
>>> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
>>> at
>>> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
>>> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
>>> at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42)
>>> at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:33)
>>> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
>>> at
>>> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:192)
>>> at
>>> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:181)
>>> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>>> at
>>> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>>> at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:966)
>>> at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:972)
>>> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>>> at
>>> com.mycorpt.myprojjobs.spark.jobs.hbase.HbaseUtils$$anonfun$writeRddToHBase2$1.apply(HbaseUtils.scala:80)
>>> at
>>> com.mycorpt.myprojjobs.spark.jobs.hbase.HbaseUtils$$anonfun$writeRddToHBase2$1.apply(HbaseUtils.scala:75)
>>> at
>>> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:902)
>>> at
>>> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:902)
>>> at
>>> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
>>> at
>>> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
>>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>>> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> at java.lang.Thread.run(Thread.java:744)
>>> Caused by: java.lang.IndexOutOfBoundsException: Index: 100, Size: 6
>>> at java.util.ArrayList.rangeCheck(ArrayList.java:635)
>>> at java.util.ArrayList.get(ArrayList.java:411)
>>> at
>>> 

Re: Bulk loading Serialized RDD into Hbase throws KryoException - IndexOutOfBoundsException

2016-05-29 Thread Ted Yu
Can you register Put with Kryo ?

Thanks

> On May 29, 2016, at 4:58 PM, Nirav Patel  wrote:
> 
> I pasted code snipped for that method.
> 
> here's full def:
> 
>   def writeRddToHBase2(hbaseRdd: RDD[(ImmutableBytesWritable, Put)], 
> tableName: String) {
> 
> 
> 
> hbaseRdd.values.foreachPartition{ itr =>
> 
> val hConf = HBaseConfiguration.create()
> 
> hConf.setInt("hbase.client.write.buffer", 16097152)
> 
> val table = new HTable(hConf, tableName)
> 
> //table.setWriteBufferSize(8388608)
> 
> itr.grouped(100).foreach(table.put(_))   // << Exception happens at 
> this point
> 
> table.close()
> 
> }
> 
>   }
> 
> 
> 
> I am using hbase 0.98.12 mapr distribution.
> 
> 
> 
> Thanks
> 
> Nirav
> 
> 
>> On Sun, May 29, 2016 at 4:46 PM, Ted Yu  wrote:
>> bq.  at 
>> com.mycorpt.myprojjobs.spark.jobs.hbase.HbaseUtils$$anonfun$writeRddToHBase2$1.apply(HbaseUtils.scala:80)
>> 
>> Can you reveal related code from HbaseUtils.scala ?
>> 
>> Which hbase version are you using ?
>> 
>> Thanks
>> 
>>> On Sun, May 29, 2016 at 4:26 PM, Nirav Patel  wrote:
>>> Hi,
>>> 
>>> I am getting following Kryo deserialization error when trying to buklload 
>>> Cached RDD into Hbase. It works if I don't cache the RDD. I cache it with 
>>> MEMORY_ONLY_SER.
>>> 
>>> here's the code snippet:
>>> 
>>> 
>>> hbaseRdd.values.foreachPartition{ itr =>
>>> val hConf = HBaseConfiguration.create()
>>> hConf.setInt("hbase.client.write.buffer", 16097152)
>>> val table = new HTable(hConf, tableName)
>>> itr.grouped(100).foreach(table.put(_))
>>> table.close()
>>> }
>>> hbaseRdd is of type RDD[(ImmutableBytesWritable, Put)]
>>> 
>>> 
>>> Exception I am getting. I read on Kryo JIRA that this may be issue with 
>>> incorrect use of serialization library. So could this be issue with 
>>> twitter-chill library or spark core it self ? 
>>> 
>>> Job aborted due to stage failure: Task 16 in stage 9.0 failed 10 times, 
>>> most recent failure: Lost task 16.9 in stage 9.0 (TID 28614, 
>>> hdn10.mycorptcorporation.local): com.esotericsoftware.kryo.KryoException: 
>>> java.lang.IndexOutOfBoundsException: Index: 100, Size: 6
>>> Serialization trace:
>>> familyMap (org.apache.hadoop.hbase.client.Put)
>>> at 
>>> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
>>> at 
>>> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
>>> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
>>> at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42)
>>> at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:33)
>>> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
>>> at 
>>> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:192)
>>> at 
>>> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:181)
>>> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>>> at 
>>> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>>> at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:966)
>>> at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:972)
>>> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>>> at 
>>> com.mycorpt.myprojjobs.spark.jobs.hbase.HbaseUtils$$anonfun$writeRddToHBase2$1.apply(HbaseUtils.scala:80)
>>> at 
>>> com.mycorpt.myprojjobs.spark.jobs.hbase.HbaseUtils$$anonfun$writeRddToHBase2$1.apply(HbaseUtils.scala:75)
>>> at 
>>> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:902)
>>> at 
>>> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:902)
>>> at 
>>> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
>>> at 
>>> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
>>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>>> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> at java.lang.Thread.run(Thread.java:744)
>>> Caused by: java.lang.IndexOutOfBoundsException: Index: 100, Size: 6
>>> at java.util.ArrayList.rangeCheck(ArrayList.java:635)
>>> at java.util.ArrayList.get(ArrayList.java:411)
>>> at 
>>> 

Re: Accessing s3a files from Spark

2016-05-29 Thread Mayuresh Kunjir
On Sun, May 29, 2016 at 7:49 PM, Ted Yu  wrote:

> Have you seen this thread ?
>
>
> http://search-hadoop.com/m/q3RTthWU8o1MbFC2=Re+Forbidded+Error+Code+403
>
>
​
Thanks for the pointer. I have followed the thread, got no success though.

I am trying out the Spark branch suggested by Teng Qiu above, will update
soon.

​


> On Sun, May 29, 2016 at 2:55 PM, Mayuresh Kunjir 
> wrote:
>
>> I'm running into permission issues while accessing data in S3 bucket
>> stored using s3a file system from a local Spark cluster. Has anyone found
>> success with this?
>>
>> My setup is:
>> - Spark 1.6.1 compiled against Hadoop 2.7.2
>> - aws-java-sdk-1.7.4.jar and hadoop-aws-2.7.2.jar in the classpath
>> - Spark's Hadoop configuration is as follows:
>>
>>
>> sc.hadoopConfiguration.set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
>>
>> sc.hadoopConfiguration.set("fs.s3a.access.key", )
>>
>> sc.hadoopConfiguration.set("fs.s3a.secret.key", )
>>
>> (The secret key does not have any '/' characters which is reported to
>> cause some issue by others)
>>
>>
>> I have configured my S3 bucket to grant the necessary permissions. (
>> https://sparkour.urizone.net/recipes/configuring-s3/)
>>
>>
>> What works: Listing, reading from, and writing to s3a using hadoop
>> command. e.g. hadoop dfs -ls s3a:///
>>
>>
>> What doesn't work: Reading from s3a using Spark's textFile API. Each task
>> throws an exception which says *Forbidden Access(403)*.
>>
>>
>> Some online documents suggest to use IAM roles to grant permissions for
>> an AWS cluster. But I would like a solution for my local standalone cluster.
>>
>>
>> Any help would be appreciated.
>>
>>
>> Regards,
>>
>> ~Mayuresh
>>
>
>


Re: Bulk loading Serialized RDD into Hbase throws KryoException - IndexOutOfBoundsException

2016-05-29 Thread Nirav Patel
I pasted code snipped for that method.

here's full def:

  def writeRddToHBase2(hbaseRdd: RDD[(ImmutableBytesWritable, Put)],
tableName: String) {


hbaseRdd.values.foreachPartition{ itr =>

val hConf = HBaseConfiguration.create()

hConf.setInt("hbase.client.write.buffer", 16097152)

val table = new HTable(hConf, tableName)

//table.setWriteBufferSize(8388608)

*itr.grouped(100).foreach(table.put(_)) *  // << Exception happens
at this point

table.close()

}

  }


I am using hbase 0.98.12 mapr distribution.


Thanks

Nirav

On Sun, May 29, 2016 at 4:46 PM, Ted Yu  wrote:

> bq.  at com.mycorpt.myprojjobs.spark.jobs.hbase.HbaseUtils$$
> anonfun$writeRddToHBase2$1.apply(HbaseUtils.scala:80)
>
> Can you reveal related code from HbaseUtils.scala ?
>
> Which hbase version are you using ?
>
> Thanks
>
> On Sun, May 29, 2016 at 4:26 PM, Nirav Patel 
> wrote:
>
>> Hi,
>>
>> I am getting following Kryo deserialization error when trying to buklload
>> Cached RDD into Hbase. It works if I don't cache the RDD. I cache it
>> with MEMORY_ONLY_SER.
>>
>> here's the code snippet:
>>
>>
>> hbaseRdd.values.foreachPartition{ itr =>
>> val hConf = HBaseConfiguration.create()
>> hConf.setInt("hbase.client.write.buffer", 16097152)
>> val table = new HTable(hConf, tableName)
>> itr.grouped(100).foreach(table.put(_))
>> table.close()
>> }
>> hbaseRdd is of type RDD[(ImmutableBytesWritable, Put)]
>>
>>
>> Exception I am getting. I read on Kryo JIRA that this may be issue with
>> incorrect use of serialization library. So could this be issue with
>> twitter-chill library or spark core it self ?
>>
>> Job aborted due to stage failure: Task 16 in stage 9.0 failed 10 times,
>> most recent failure: Lost task 16.9 in stage 9.0 (TID 28614,
>> hdn10.mycorptcorporation.local): com.esotericsoftware.kryo.KryoException:
>> java.lang.IndexOutOfBoundsException: Index: 100, Size: 6
>> Serialization trace:
>> familyMap (org.apache.hadoop.hbase.client.Put)
>> at
>> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
>> at
>> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
>> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
>> at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42)
>> at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:33)
>> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
>> at
>> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:192)
>> at
>> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:181)
>> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>> at
>> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>> at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:966)
>> at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:972)
>> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>> at
>> com.mycorpt.myprojjobs.spark.jobs.hbase.HbaseUtils$$anonfun$writeRddToHBase2$1.apply(HbaseUtils.scala:80)
>> at
>> com.mycorpt.myprojjobs.spark.jobs.hbase.HbaseUtils$$anonfun$writeRddToHBase2$1.apply(HbaseUtils.scala:75)
>> at
>> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:902)
>> at
>> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:902)
>> at
>> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
>> at
>> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:744)
>> Caused by: java.lang.IndexOutOfBoundsException: Index: 100, Size: 6
>> at java.util.ArrayList.rangeCheck(ArrayList.java:635)
>> at java.util.ArrayList.get(ArrayList.java:411)
>> at
>> com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:42)
>> at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:773)
>> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:727)
>> at
>> com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:134)
>> at
>> com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:17)
>> at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)

Re: Accessing s3a files from Spark

2016-05-29 Thread Ted Yu
Have you seen this thread ?

http://search-hadoop.com/m/q3RTthWU8o1MbFC2=Re+Forbidded+Error+Code+403

On Sun, May 29, 2016 at 2:55 PM, Mayuresh Kunjir 
wrote:

> I'm running into permission issues while accessing data in S3 bucket
> stored using s3a file system from a local Spark cluster. Has anyone found
> success with this?
>
> My setup is:
> - Spark 1.6.1 compiled against Hadoop 2.7.2
> - aws-java-sdk-1.7.4.jar and hadoop-aws-2.7.2.jar in the classpath
> - Spark's Hadoop configuration is as follows:
>
>
> sc.hadoopConfiguration.set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
>
> sc.hadoopConfiguration.set("fs.s3a.access.key", )
>
> sc.hadoopConfiguration.set("fs.s3a.secret.key", )
>
> (The secret key does not have any '/' characters which is reported to
> cause some issue by others)
>
>
> I have configured my S3 bucket to grant the necessary permissions. (
> https://sparkour.urizone.net/recipes/configuring-s3/)
>
>
> What works: Listing, reading from, and writing to s3a using hadoop
> command. e.g. hadoop dfs -ls s3a:///
>
>
> What doesn't work: Reading from s3a using Spark's textFile API. Each task
> throws an exception which says *Forbidden Access(403)*.
>
>
> Some online documents suggest to use IAM roles to grant permissions for an
> AWS cluster. But I would like a solution for my local standalone cluster.
>
>
> Any help would be appreciated.
>
>
> Regards,
>
> ~Mayuresh
>


Re: Bulk loading Serialized RDD into Hbase throws KryoException - IndexOutOfBoundsException

2016-05-29 Thread Ted Yu
bq.  at com.mycorpt.myprojjobs.spark.jobs.hbase.HbaseUtils$$
anonfun$writeRddToHBase2$1.apply(HbaseUtils.scala:80)

Can you reveal related code from HbaseUtils.scala ?

Which hbase version are you using ?

Thanks

On Sun, May 29, 2016 at 4:26 PM, Nirav Patel  wrote:

> Hi,
>
> I am getting following Kryo deserialization error when trying to buklload
> Cached RDD into Hbase. It works if I don't cache the RDD. I cache it
> with MEMORY_ONLY_SER.
>
> here's the code snippet:
>
>
> hbaseRdd.values.foreachPartition{ itr =>
> val hConf = HBaseConfiguration.create()
> hConf.setInt("hbase.client.write.buffer", 16097152)
> val table = new HTable(hConf, tableName)
> itr.grouped(100).foreach(table.put(_))
> table.close()
> }
> hbaseRdd is of type RDD[(ImmutableBytesWritable, Put)]
>
>
> Exception I am getting. I read on Kryo JIRA that this may be issue with
> incorrect use of serialization library. So could this be issue with
> twitter-chill library or spark core it self ?
>
> Job aborted due to stage failure: Task 16 in stage 9.0 failed 10 times,
> most recent failure: Lost task 16.9 in stage 9.0 (TID 28614,
> hdn10.mycorptcorporation.local): com.esotericsoftware.kryo.KryoException:
> java.lang.IndexOutOfBoundsException: Index: 100, Size: 6
> Serialization trace:
> familyMap (org.apache.hadoop.hbase.client.Put)
> at
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
> at
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
> at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42)
> at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:33)
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
> at
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:192)
> at
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:181)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:966)
> at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:972)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at
> com.mycorpt.myprojjobs.spark.jobs.hbase.HbaseUtils$$anonfun$writeRddToHBase2$1.apply(HbaseUtils.scala:80)
> at
> com.mycorpt.myprojjobs.spark.jobs.hbase.HbaseUtils$$anonfun$writeRddToHBase2$1.apply(HbaseUtils.scala:75)
> at
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:902)
> at
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:902)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> Caused by: java.lang.IndexOutOfBoundsException: Index: 100, Size: 6
> at java.util.ArrayList.rangeCheck(ArrayList.java:635)
> at java.util.ArrayList.get(ArrayList.java:411)
> at
> com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:42)
> at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:773)
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:727)
> at
> com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:134)
> at
> com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:17)
> at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
> at
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
> ... 26 more
>
>
>
> [image: What's New with Xactly] 
>
>   [image: LinkedIn]
>   [image: Twitter]
>   [image: Facebook]
>   [image: YouTube]
> 


Bulk loading Serialized RDD into Hbase throws KryoException - IndexOutOfBoundsException

2016-05-29 Thread Nirav Patel
Hi,

I am getting following Kryo deserialization error when trying to buklload
Cached RDD into Hbase. It works if I don't cache the RDD. I cache it
with MEMORY_ONLY_SER.

here's the code snippet:


hbaseRdd.values.foreachPartition{ itr =>
val hConf = HBaseConfiguration.create()
hConf.setInt("hbase.client.write.buffer", 16097152)
val table = new HTable(hConf, tableName)
itr.grouped(100).foreach(table.put(_))
table.close()
}
hbaseRdd is of type RDD[(ImmutableBytesWritable, Put)]


Exception I am getting. I read on Kryo JIRA that this may be issue with
incorrect use of serialization library. So could this be issue with
twitter-chill library or spark core it self ?

Job aborted due to stage failure: Task 16 in stage 9.0 failed 10 times,
most recent failure: Lost task 16.9 in stage 9.0 (TID 28614,
hdn10.mycorptcorporation.local): com.esotericsoftware.kryo.KryoException:
java.lang.IndexOutOfBoundsException: Index: 100, Size: 6
Serialization trace:
familyMap (org.apache.hadoop.hbase.client.Put)
at
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
at
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42)
at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:33)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:192)
at
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:181)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:966)
at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:972)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at
com.mycorpt.myprojjobs.spark.jobs.hbase.HbaseUtils$$anonfun$writeRddToHBase2$1.apply(HbaseUtils.scala:80)
at
com.mycorpt.myprojjobs.spark.jobs.hbase.HbaseUtils$$anonfun$writeRddToHBase2$1.apply(HbaseUtils.scala:75)
at
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:902)
at
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:902)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.IndexOutOfBoundsException: Index: 100, Size: 6
at java.util.ArrayList.rangeCheck(ArrayList.java:635)
at java.util.ArrayList.get(ArrayList.java:411)
at
com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:42)
at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:773)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:727)
at
com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:134)
at
com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:17)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
at
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
... 26 more

-- 


[image: What's New with Xactly] 

  [image: LinkedIn] 
  [image: Twitter] 
  [image: Facebook] 
  [image: YouTube] 



Re: Multinomial regression with spark.ml version of LogisticRegression

2016-05-29 Thread Stephen Boesch
Thanks Bryan for that pointer : I will follow it. In the meantime the One
vs Rest appears to satisfy the requirements.

2016-05-29 15:40 GMT-07:00 Bryan Cutler :

> This is currently being worked on, planned for 2.1 I believe
> https://issues.apache.org/jira/browse/SPARK-7159
> On May 28, 2016 9:31 PM, "Stephen Boesch"  wrote:
>
>> Thanks Phuong But the point of my post is how to achieve without using
>>  the deprecated the mllib pacakge. The mllib package already has
>>  multinomial regression built in
>>
>> 2016-05-28 21:19 GMT-07:00 Phuong LE-HONG :
>>
>>> Dear Stephen,
>>>
>>> Yes, you're right, LogisticGradient is in the mllib package, not ml
>>> package. I just want to say that we can build a multinomial logistic
>>> regression model from the current version of Spark.
>>>
>>> Regards,
>>>
>>> Phuong
>>>
>>>
>>>
>>> On Sun, May 29, 2016 at 12:04 AM, Stephen Boesch 
>>> wrote:
>>> > Hi Phuong,
>>> >The LogisticGradient exists in the mllib but not ml package. The
>>> > LogisticRegression chooses either the breeze LBFGS - if L2 only (not
>>> elastic
>>> > net) and no regularization or the Orthant Wise Quasi Newton (OWLQN)
>>> > otherwise: it does not appear to choose GD in either scenario.
>>> >
>>> > If I have misunderstood your response please do clarify.
>>> >
>>> > thanks stephenb
>>> >
>>> > 2016-05-28 20:55 GMT-07:00 Phuong LE-HONG :
>>> >>
>>> >> Dear Stephen,
>>> >>
>>> >> The Logistic Regression currently supports only binary regression.
>>> >> However, the LogisticGradient does support computing gradient and loss
>>> >> for a multinomial logistic regression. That is, you can train a
>>> >> multinomial logistic regression model with LogisticGradient and a
>>> >> class to solve optimization like LBFGS to get a weight vector of the
>>> >> size (numClassrd-1)*numFeatures.
>>> >>
>>> >>
>>> >> Phuong
>>> >>
>>> >>
>>> >> On Sat, May 28, 2016 at 12:25 PM, Stephen Boesch 
>>> >> wrote:
>>> >> > Followup: just encountered the "OneVsRest" classifier in
>>> >> > ml.classsification: I will look into using it with the binary
>>> >> > LogisticRegression as the provided classifier.
>>> >> >
>>> >> > 2016-05-28 9:06 GMT-07:00 Stephen Boesch :
>>> >> >>
>>> >> >>
>>> >> >> Presently only the mllib version has the one-vs-all approach for
>>> >> >> multinomial support.  The ml version with ElasticNet support only
>>> >> >> allows
>>> >> >> binary regression.
>>> >> >>
>>> >> >> With feature parity of ml vs mllib having been stated as an
>>> objective
>>> >> >> for
>>> >> >> 2.0.0 -  is there a projected availability of the  multinomial
>>> >> >> regression in
>>> >> >> the ml package?
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> `
>>> >> >
>>> >> >
>>> >
>>> >
>>>
>>
>>


Re: Multinomial regression with spark.ml version of LogisticRegression

2016-05-29 Thread Bryan Cutler
This is currently being worked on, planned for 2.1 I believe
https://issues.apache.org/jira/browse/SPARK-7159
On May 28, 2016 9:31 PM, "Stephen Boesch"  wrote:

> Thanks Phuong But the point of my post is how to achieve without using
>  the deprecated the mllib pacakge. The mllib package already has
>  multinomial regression built in
>
> 2016-05-28 21:19 GMT-07:00 Phuong LE-HONG :
>
>> Dear Stephen,
>>
>> Yes, you're right, LogisticGradient is in the mllib package, not ml
>> package. I just want to say that we can build a multinomial logistic
>> regression model from the current version of Spark.
>>
>> Regards,
>>
>> Phuong
>>
>>
>>
>> On Sun, May 29, 2016 at 12:04 AM, Stephen Boesch 
>> wrote:
>> > Hi Phuong,
>> >The LogisticGradient exists in the mllib but not ml package. The
>> > LogisticRegression chooses either the breeze LBFGS - if L2 only (not
>> elastic
>> > net) and no regularization or the Orthant Wise Quasi Newton (OWLQN)
>> > otherwise: it does not appear to choose GD in either scenario.
>> >
>> > If I have misunderstood your response please do clarify.
>> >
>> > thanks stephenb
>> >
>> > 2016-05-28 20:55 GMT-07:00 Phuong LE-HONG :
>> >>
>> >> Dear Stephen,
>> >>
>> >> The Logistic Regression currently supports only binary regression.
>> >> However, the LogisticGradient does support computing gradient and loss
>> >> for a multinomial logistic regression. That is, you can train a
>> >> multinomial logistic regression model with LogisticGradient and a
>> >> class to solve optimization like LBFGS to get a weight vector of the
>> >> size (numClassrd-1)*numFeatures.
>> >>
>> >>
>> >> Phuong
>> >>
>> >>
>> >> On Sat, May 28, 2016 at 12:25 PM, Stephen Boesch 
>> >> wrote:
>> >> > Followup: just encountered the "OneVsRest" classifier in
>> >> > ml.classsification: I will look into using it with the binary
>> >> > LogisticRegression as the provided classifier.
>> >> >
>> >> > 2016-05-28 9:06 GMT-07:00 Stephen Boesch :
>> >> >>
>> >> >>
>> >> >> Presently only the mllib version has the one-vs-all approach for
>> >> >> multinomial support.  The ml version with ElasticNet support only
>> >> >> allows
>> >> >> binary regression.
>> >> >>
>> >> >> With feature parity of ml vs mllib having been stated as an
>> objective
>> >> >> for
>> >> >> 2.0.0 -  is there a projected availability of the  multinomial
>> >> >> regression in
>> >> >> the ml package?
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> `
>> >> >
>> >> >
>> >
>> >
>>
>
>


Accessing s3a files from Spark

2016-05-29 Thread Mayuresh Kunjir
I'm running into permission issues while accessing data in S3 bucket stored
using s3a file system from a local Spark cluster. Has anyone found success
with this?

My setup is:
- Spark 1.6.1 compiled against Hadoop 2.7.2
- aws-java-sdk-1.7.4.jar and hadoop-aws-2.7.2.jar in the classpath
- Spark's Hadoop configuration is as follows:

sc.hadoopConfiguration.set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")

sc.hadoopConfiguration.set("fs.s3a.access.key", )

sc.hadoopConfiguration.set("fs.s3a.secret.key", )

(The secret key does not have any '/' characters which is reported to cause
some issue by others)


I have configured my S3 bucket to grant the necessary permissions. (
https://sparkour.urizone.net/recipes/configuring-s3/)


What works: Listing, reading from, and writing to s3a using hadoop command.
e.g. hadoop dfs -ls s3a:///


What doesn't work: Reading from s3a using Spark's textFile API. Each task
throws an exception which says *Forbidden Access(403)*.


Some online documents suggest to use IAM roles to grant permissions for an
AWS cluster. But I would like a solution for my local standalone cluster.


Any help would be appreciated.


Regards,

~Mayuresh


Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-29 Thread Mich Talebzadeh
thanks I think the problem is that the TEZ user group is exceptionally
quiet. Just sent an email to Hive user group to see anyone has managed to
built a vendor independent version.


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 29 May 2016 at 21:23, Jörn Franke  wrote:

> Well I think it is different from MR. It has some optimizations which you
> do not find in MR. Especially the LLAP option in Hive2 makes it
> interesting.
>
> I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it is
> integrated in the Hortonworks distribution.
>
>
> On 29 May 2016, at 21:43, Mich Talebzadeh 
> wrote:
>
> Hi Jorn,
>
> I started building apache-tez-0.8.2 but got few errors. Couple of guys
> from TEZ user group kindly gave a hand but I could not go very far (or may
> be I did not make enough efforts) making it work.
>
> That TEZ user group is very quiet as well.
>
> My understanding is TEZ is MR with DAG but of course Spark has both plus
> in-memory capability.
>
> It would be interesting to see what version of TEZ works as execution
> engine with Hive.
>
> Vendors are divided on this (use Hive with TEZ) or use Impala instead of
> Hive etc as I am sure you already know.
>
> Cheers,
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 29 May 2016 at 20:19, Jörn Franke  wrote:
>
>> Very interesting do you plan also a test with TEZ?
>>
>> On 29 May 2016, at 13:40, Mich Talebzadeh 
>> wrote:
>>
>> Hi,
>>
>> I did another study of Hive using Spark engine compared to Hive with MR.
>>
>> Basically took the original table imported using Sqoop and created and
>> populated a new ORC table partitioned by year and month into 48 partitions
>> as follows:
>>
>> 
>> ​
>> Connections use JDBC via beeline. Now for each partition using MR it
>> takes an average of 17 minutes as seen below for each PARTITION..  Now that
>> is just an individual partition and there are 48 partitions.
>>
>> In contrast doing the same operation with Spark engine took 10 minutes
>> all inclusive. I just gave up on MR. You can see the StartTime and
>> FinishTime from below
>>
>> 
>>
>> This is by no means indicate that Spark is much better than MR but shows
>> that some very good results can ve achieved using Spark engine.
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 24 May 2016 at 08:03, Mich Talebzadeh 
>> wrote:
>>
>>> Hi,
>>>
>>> We use Hive as the database and use Spark as an all purpose query tool.
>>>
>>> Whether Hive is the write database for purpose or one is better off with
>>> something like Phoenix on Hbase, well the answer is it depends and your
>>> mileage varies.
>>>
>>> So fit for purpose.
>>>
>>> Ideally what wants is to use the fastest  method to get the results. How
>>> fast we confine it to our SLA agreements in production and that helps us
>>> from unnecessary further work as we technologists like to play around.
>>>
>>> So in short, we use Spark most of the time and use Hive as the backend
>>> engine for data storage, mainly ORC tables.
>>>
>>> We use Hive on Spark and with Hive 2 on Spark 1.3.1 for now we have a
>>> combination that works. Granted it helps to use Hive 2 on Spark 1.6.1 but
>>> at the moment it is one of my projects.
>>>
>>> We do not use any vendor's products as it enables us to move away  from
>>> being tied down after years of SAP, Oracle and MS dependency to yet another
>>> vendor. Besides there is some politics going on with one promoting Tez and
>>> another Spark as a backend. That is fine but obviously we prefer an
>>> independent assessment ourselves.
>>>
>>> My gut feeling is that one needs to look at the use case. Recently we
>>> had to import a very large table from Oracle to Hive and decided to use
>>> Spark 1.6.1 with Hive 2 on Spark 1.3.1 and that worked fine. We just used
>>> JDBC connection with temp table and it was good. We could have used sqoop
>>> but decided to settle for Spark so it all depends on use case.
>>>
>>> HTH
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> 

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-29 Thread Jörn Franke
Well I think it is different from MR. It has some optimizations which you do 
not find in MR. Especially the LLAP option in Hive2 makes it interesting. 

I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it is 
integrated in the Hortonworks distribution. 


> On 29 May 2016, at 21:43, Mich Talebzadeh  wrote:
> 
> Hi Jorn,
> 
> I started building apache-tez-0.8.2 but got few errors. Couple of guys from 
> TEZ user group kindly gave a hand but I could not go very far (or may be I 
> did not make enough efforts) making it work.
> 
> That TEZ user group is very quiet as well.
> 
> My understanding is TEZ is MR with DAG but of course Spark has both plus 
> in-memory capability.
> 
> It would be interesting to see what version of TEZ works as execution engine 
> with Hive. 
> 
> Vendors are divided on this (use Hive with TEZ) or use Impala instead of Hive 
> etc as I am sure you already know.
> 
> Cheers,
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 29 May 2016 at 20:19, Jörn Franke  wrote:
>> Very interesting do you plan also a test with TEZ?
>> 
>>> On 29 May 2016, at 13:40, Mich Talebzadeh  wrote:
>>> 
>>> Hi,
>>> 
>>> I did another study of Hive using Spark engine compared to Hive with MR.
>>> 
>>> Basically took the original table imported using Sqoop and created and 
>>> populated a new ORC table partitioned by year and month into 48 partitions 
>>> as follows:
>>> 
>>> 
>>> ​ 
>>> Connections use JDBC via beeline. Now for each partition using MR it takes 
>>> an average of 17 minutes as seen below for each PARTITION..  Now that is 
>>> just an individual partition and there are 48 partitions.
>>> 
>>> In contrast doing the same operation with Spark engine took 10 minutes all 
>>> inclusive. I just gave up on MR. You can see the StartTime and FinishTime 
>>> from below
>>> 
>>> 
>>> 
>>> This is by no means indicate that Spark is much better than MR but shows 
>>> that some very good results can ve achieved using Spark engine.
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> http://talebzadehmich.wordpress.com
>>>  
>>> 
 On 24 May 2016 at 08:03, Mich Talebzadeh  wrote:
 Hi,
 
 We use Hive as the database and use Spark as an all purpose query tool.
 
 Whether Hive is the write database for purpose or one is better off with 
 something like Phoenix on Hbase, well the answer is it depends and your 
 mileage varies. 
 
 So fit for purpose.
 
 Ideally what wants is to use the fastest  method to get the results. How 
 fast we confine it to our SLA agreements in production and that helps us 
 from unnecessary further work as we technologists like to play around.
 
 So in short, we use Spark most of the time and use Hive as the backend 
 engine for data storage, mainly ORC tables.
 
 We use Hive on Spark and with Hive 2 on Spark 1.3.1 for now we have a 
 combination that works. Granted it helps to use Hive 2 on Spark 1.6.1 but 
 at the moment it is one of my projects.
 
 We do not use any vendor's products as it enables us to move away  from 
 being tied down after years of SAP, Oracle and MS dependency to yet 
 another vendor. Besides there is some politics going on with one promoting 
 Tez and another Spark as a backend. That is fine but obviously we prefer 
 an independent assessment ourselves.
 
 My gut feeling is that one needs to look at the use case. Recently we had 
 to import a very large table from Oracle to Hive and decided to use Spark 
 1.6.1 with Hive 2 on Spark 1.3.1 and that worked fine. We just used JDBC 
 connection with temp table and it was good. We could have used sqoop but 
 decided to settle for Spark so it all depends on use case.
 
 HTH
 
 
 
 Dr Mich Talebzadeh
  
 LinkedIn  
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
  
 http://talebzadehmich.wordpress.com
  
 
> On 24 May 2016 at 03:11, ayan guha  wrote:
> Hi
> 
> Thanks for very useful stats. 
> 
> Did you have any benchmark for using Spark as backend engine for Hive vs 
> using Spark thrift server (and run spark code for hive queries)? We are 
> using later but it will be very useful to remove thriftserver, if we can. 
> 
>> On Tue, May 24, 2016 at 9:51 AM, Jörn Franke  
>> wrote:
>> 
>> Hi Mich,
>> 
>> I think these comparisons are useful. One interesting aspect could be 
>> hardware scalability in this context. Additionally different type of 
>> 

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-29 Thread Mich Talebzadeh
Hi Jorn,

I started building apache-tez-0.8.2 but got few errors. Couple of guys from
TEZ user group kindly gave a hand but I could not go very far (or may be I
did not make enough efforts) making it work.

That TEZ user group is very quiet as well.

My understanding is TEZ is MR with DAG but of course Spark has both plus
in-memory capability.

It would be interesting to see what version of TEZ works as execution
engine with Hive.

Vendors are divided on this (use Hive with TEZ) or use Impala instead of
Hive etc as I am sure you already know.

Cheers,




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 29 May 2016 at 20:19, Jörn Franke  wrote:

> Very interesting do you plan also a test with TEZ?
>
> On 29 May 2016, at 13:40, Mich Talebzadeh 
> wrote:
>
> Hi,
>
> I did another study of Hive using Spark engine compared to Hive with MR.
>
> Basically took the original table imported using Sqoop and created and
> populated a new ORC table partitioned by year and month into 48 partitions
> as follows:
>
> 
> ​
> Connections use JDBC via beeline. Now for each partition using MR it takes
> an average of 17 minutes as seen below for each PARTITION..  Now that is
> just an individual partition and there are 48 partitions.
>
> In contrast doing the same operation with Spark engine took 10 minutes all
> inclusive. I just gave up on MR. You can see the StartTime and FinishTime
> from below
>
> 
>
> This is by no means indicate that Spark is much better than MR but shows
> that some very good results can ve achieved using Spark engine.
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 24 May 2016 at 08:03, Mich Talebzadeh 
> wrote:
>
>> Hi,
>>
>> We use Hive as the database and use Spark as an all purpose query tool.
>>
>> Whether Hive is the write database for purpose or one is better off with
>> something like Phoenix on Hbase, well the answer is it depends and your
>> mileage varies.
>>
>> So fit for purpose.
>>
>> Ideally what wants is to use the fastest  method to get the results. How
>> fast we confine it to our SLA agreements in production and that helps us
>> from unnecessary further work as we technologists like to play around.
>>
>> So in short, we use Spark most of the time and use Hive as the backend
>> engine for data storage, mainly ORC tables.
>>
>> We use Hive on Spark and with Hive 2 on Spark 1.3.1 for now we have a
>> combination that works. Granted it helps to use Hive 2 on Spark 1.6.1 but
>> at the moment it is one of my projects.
>>
>> We do not use any vendor's products as it enables us to move away  from
>> being tied down after years of SAP, Oracle and MS dependency to yet another
>> vendor. Besides there is some politics going on with one promoting Tez and
>> another Spark as a backend. That is fine but obviously we prefer an
>> independent assessment ourselves.
>>
>> My gut feeling is that one needs to look at the use case. Recently we had
>> to import a very large table from Oracle to Hive and decided to use Spark
>> 1.6.1 with Hive 2 on Spark 1.3.1 and that worked fine. We just used JDBC
>> connection with temp table and it was good. We could have used sqoop but
>> decided to settle for Spark so it all depends on use case.
>>
>> HTH
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 24 May 2016 at 03:11, ayan guha  wrote:
>>
>>> Hi
>>>
>>> Thanks for very useful stats.
>>>
>>> Did you have any benchmark for using Spark as backend engine for Hive vs
>>> using Spark thrift server (and run spark code for hive queries)? We are
>>> using later but it will be very useful to remove thriftserver, if we can.
>>>
>>> On Tue, May 24, 2016 at 9:51 AM, Jörn Franke 
>>> wrote:
>>>

 Hi Mich,

 I think these comparisons are useful. One interesting aspect could be
 hardware scalability in this context. Additionally different type of
 computations. Furthermore, one could compare Spark and Tez+llap as
 execution engines. I have the gut feeling that  each one can be justified
 by different use cases.
 Nevertheless, there should be always a disclaimer for such comparisons,
 because Spark and Hive are not good for a lot of concurrent lookups of
 single rows. They are not good for frequently write small amounts 

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-29 Thread Jörn Franke
Very interesting do you plan also a test with TEZ?

> On 29 May 2016, at 13:40, Mich Talebzadeh  wrote:
> 
> Hi,
> 
> I did another study of Hive using Spark engine compared to Hive with MR.
> 
> Basically took the original table imported using Sqoop and created and 
> populated a new ORC table partitioned by year and month into 48 partitions as 
> follows:
> 
> 
> ​ 
> Connections use JDBC via beeline. Now for each partition using MR it takes an 
> average of 17 minutes as seen below for each PARTITION..  Now that is just an 
> individual partition and there are 48 partitions.
> 
> In contrast doing the same operation with Spark engine took 10 minutes all 
> inclusive. I just gave up on MR. You can see the StartTime and FinishTime 
> from below
> 
> 
> 
> This is by no means indicate that Spark is much better than MR but shows that 
> some very good results can ve achieved using Spark engine.
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 24 May 2016 at 08:03, Mich Talebzadeh  wrote:
>> Hi,
>> 
>> We use Hive as the database and use Spark as an all purpose query tool.
>> 
>> Whether Hive is the write database for purpose or one is better off with 
>> something like Phoenix on Hbase, well the answer is it depends and your 
>> mileage varies. 
>> 
>> So fit for purpose.
>> 
>> Ideally what wants is to use the fastest  method to get the results. How 
>> fast we confine it to our SLA agreements in production and that helps us 
>> from unnecessary further work as we technologists like to play around.
>> 
>> So in short, we use Spark most of the time and use Hive as the backend 
>> engine for data storage, mainly ORC tables.
>> 
>> We use Hive on Spark and with Hive 2 on Spark 1.3.1 for now we have a 
>> combination that works. Granted it helps to use Hive 2 on Spark 1.6.1 but at 
>> the moment it is one of my projects.
>> 
>> We do not use any vendor's products as it enables us to move away  from 
>> being tied down after years of SAP, Oracle and MS dependency to yet another 
>> vendor. Besides there is some politics going on with one promoting Tez and 
>> another Spark as a backend. That is fine but obviously we prefer an 
>> independent assessment ourselves.
>> 
>> My gut feeling is that one needs to look at the use case. Recently we had to 
>> import a very large table from Oracle to Hive and decided to use Spark 1.6.1 
>> with Hive 2 on Spark 1.3.1 and that worked fine. We just used JDBC 
>> connection with temp table and it was good. We could have used sqoop but 
>> decided to settle for Spark so it all depends on use case.
>> 
>> HTH
>> 
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>>  
>> 
>>> On 24 May 2016 at 03:11, ayan guha  wrote:
>>> Hi
>>> 
>>> Thanks for very useful stats. 
>>> 
>>> Did you have any benchmark for using Spark as backend engine for Hive vs 
>>> using Spark thrift server (and run spark code for hive queries)? We are 
>>> using later but it will be very useful to remove thriftserver, if we can. 
>>> 
 On Tue, May 24, 2016 at 9:51 AM, Jörn Franke  wrote:
 
 Hi Mich,
 
 I think these comparisons are useful. One interesting aspect could be 
 hardware scalability in this context. Additionally different type of 
 computations. Furthermore, one could compare Spark and Tez+llap as 
 execution engines. I have the gut feeling that  each one can be justified 
 by different use cases.
 Nevertheless, there should be always a disclaimer for such comparisons, 
 because Spark and Hive are not good for a lot of concurrent lookups of 
 single rows. They are not good for frequently write small amounts of data 
 (eg sensor data). Here hbase could be more interesting. Other use cases 
 can justify graph databases, such as Titan, or text analytics/ data 
 matching using Solr on Hadoop.
 Finally, even if you have a lot of data you need to think if you always 
 have to process everything. For instance, I have found valid use cases in 
 practice where we decided to evaluate 10 machine learning models in 
 parallel on only a sample of data and only evaluate the "winning" model of 
 the total of data.
 
 As always it depends :) 
 
 Best regards
 
 P.s.: at least Hortonworks has in their distribution spark 1.5 with hive 
 1.2 and spark 1.6 with hive 1.2. Maybe they have somewhere described how 
 to manage bringing both together. You may check also Apache Bigtop (vendor 
 neutral distribution) on how they managed to bring both together.
 
> On 23 May 2016, at 01:42, Mich Talebzadeh  
> wrote:
> 

Re: GraphX Java API

2016-05-29 Thread Jules Damji
Also, this blog talks about GraphsFrames implementation of some GraphX 
algorithms, accessible from Java, Scala, and Python 

https://databricks.com/blog/2016/03/03/introducing-graphframes.html

Cheers 
Jules 

Sent from my iPhone
Pardon the dumb thumb typos :)

> On May 29, 2016, at 12:24 AM, Takeshi Yamamuro  wrote:
> 
> Hi,
> 
> Have you checked GraphFrame?
> See the related discussion: See 
> https://issues.apache.org/jira/browse/SPARK-3665
> 
> // maropu
> 
>> On Fri, May 27, 2016 at 8:22 PM, Santoshakhilesh 
>>  wrote:
>> GraphX APis are available only in Scala. If you need to use GraphX you need 
>> to switch to Scala.
>> 
>>  
>> 
>> From: Kumar, Abhishek (US - Bengaluru) [mailto:abhishekkuma...@deloitte.com] 
>> Sent: 27 May 2016 19:59
>> To: user@spark.apache.org
>> Subject: GraphX Java API
>> 
>>  
>> 
>> Hi,
>> 
>>  
>> 
>> We are trying to consume the Java API for GraphX, but there is no 
>> documentation available online on the usage or examples. It would be great 
>> if we could get some examples in Java.
>> 
>>  
>> 
>> Thanks and regards,
>> 
>>  
>> 
>> Abhishek Kumar
>> 
>> Products & Services | iLab
>> 
>> Deloitte Consulting LLP
>> 
>> Block ‘C’, Divyasree Technopolis, Survey No.: 123 & 132/2, Yemlur Post, 
>> Yemlur, Bengaluru – 560037, Karnataka, India
>> 
>> Mobile: +91 7736795770
>> 
>> abhishekkuma...@deloitte.com | www.deloitte.com
>> 
>>  
>> 
>> Please consider the environment before printing.
>> 
>>  
>> 
>>  
>> 
>>  
>> 
>> This message (including any attachments) contains confidential information 
>> intended for a specific individual and purpose, and is protected by law. If 
>> you are not the intended recipient, you should delete this message and any 
>> disclosure, copying, or distribution of this message, or the taking of any 
>> action based on it, by you is strictly prohibited.
>> 
>> v.E.1
>> 
>>  
>> 
>>  
>> 
>>  
>> 
>>  
>> 
> 
> 
> 
> -- 
> ---
> Takeshi Yamamuro


Re: GraphX Java API

2016-05-29 Thread Takeshi Yamamuro
Hi,

Have you checked GraphFrame?
See the related discussion: See
https://issues.apache.org/jira/browse/SPARK-3665

// maropu

On Fri, May 27, 2016 at 8:22 PM, Santoshakhilesh <
santosh.akhil...@huawei.com> wrote:

> GraphX APis are available only in Scala. If you need to use GraphX you
> need to switch to Scala.
>
>
>
> *From:* Kumar, Abhishek (US - Bengaluru) [mailto:
> abhishekkuma...@deloitte.com]
> *Sent:* 27 May 2016 19:59
> *To:* user@spark.apache.org
> *Subject:* GraphX Java API
>
>
>
> Hi,
>
>
>
> We are trying to consume the Java API for GraphX, but there is no
> documentation available online on the usage or examples. It would be great
> if we could get some examples in Java.
>
>
>
> Thanks and regards,
>
>
>
> *Abhishek Kumar*
>
> Products & Services | iLab
>
> Deloitte Consulting LLP
>
> Block ‘C’, Divyasree Technopolis, Survey No.: 123 & 132/2, Yemlur Post,
> Yemlur, Bengaluru – 560037, Karnataka, India
>
> Mobile: +91 7736795770
>
> abhishekkuma...@deloitte.com | www.deloitte.com
>
>
>
> Please consider the environment before printing.
>
>
>
>
>
>
>
> This message (including any attachments) contains confidential information
> intended for a specific individual and purpose, and is protected by law. If
> you are not the intended recipient, you should delete this message and any
> disclosure, copying, or distribution of this message, or the taking of any
> action based on it, by you is strictly prohibited.
>
> v.E.1
>
>
>
>
>
>
>
>
>



-- 
---
Takeshi Yamamuro