Spark Streaming stateful operation to HBase

2016-06-08 Thread soumick dasgupta
Hi,

I am using mapwithstate to keep the state and then ouput the result to
HBase. The problem I am facing is when there are no files arriving, the RDD
is still emitting the previous state result due to the checkpoint. Is there
a way I can restrict not to write that result to HBase, i.e., when the RDD
is emitting result from the previous state, it should not print/write that
value.

Thank You,

Soumick Dasgupta


Re: Spark Partition by Columns doesn't work properly

2016-06-08 Thread Jasleen Kaur
The github repo is https://github.com/datastax/spark-cassandra-connector

The talk video and slides should be uploaded soon on spark summit website

On Wednesday, June 8, 2016, Chanh Le  wrote:

> Thanks, I'll look into it. Any luck to get link related to.
>
> On Thu, Jun 9, 2016, 12:43 PM Jasleen Kaur  > wrote:
>
>> Try using the datastax package. There was a great talk on spark summit
>> about it. It will take care of the boiler plate code and you can focus on
>> real business value
>>
>> On Wednesday, June 8, 2016, Chanh Le > > wrote:
>>
>>> Hi everyone,
>>> I tested the partition by columns of data frame but it’s not good I mean
>>> wrong.
>>> I am using Spark 1.6.1 load data from Cassandra.
>>> I repartition by 2 field date, network_id - 200 partitions
>>> I reparation by 1 field date - 200 partitions.
>>> but my data is data of 90 days -> I mean if we reparation by date it
>>> will be 90 partitions.
>>>
>>> val daily = sql
>>>   .read
>>>   .format("org.apache.spark.sql.cassandra")
>>>   .options(Map("table" -> dailyDetailTableName, "keyspace" -> reportSpace))
>>>   .load()
>>>   .repartition(col("date"))
>>>
>>>
>>>
>>> I mean It doesn’t change the way I put the columns to repartition.
>>>
>>> Does anyone has the same problem?
>>>
>>> Thank in advance.
>>>
>>


Re: Spark Partition by Columns doesn't work properly

2016-06-08 Thread Chanh Le
Thanks, I'll look into it. Any luck to get link related to.

On Thu, Jun 9, 2016, 12:43 PM Jasleen Kaur 
wrote:

> Try using the datastax package. There was a great talk on spark summit
> about it. It will take care of the boiler plate code and you can focus on
> real business value
>
> On Wednesday, June 8, 2016, Chanh Le  wrote:
>
>> Hi everyone,
>> I tested the partition by columns of data frame but it’s not good I mean
>> wrong.
>> I am using Spark 1.6.1 load data from Cassandra.
>> I repartition by 2 field date, network_id - 200 partitions
>> I reparation by 1 field date - 200 partitions.
>> but my data is data of 90 days -> I mean if we reparation by date it will
>> be 90 partitions.
>>
>> val daily = sql
>>   .read
>>   .format("org.apache.spark.sql.cassandra")
>>   .options(Map("table" -> dailyDetailTableName, "keyspace" -> reportSpace))
>>   .load()
>>   .repartition(col("date"))
>>
>>
>>
>> I mean It doesn’t change the way I put the columns to repartition.
>>
>> Does anyone has the same problem?
>>
>> Thank in advance.
>>
>


Re: Spark Partition by Columns doesn't work properly

2016-06-08 Thread Jasleen Kaur
Try using the datastax package. There was a great talk on spark summit
about it. It will take care of the boiler plate code and you can focus on
real business value

On Wednesday, June 8, 2016, Chanh Le  wrote:

> Hi everyone,
> I tested the partition by columns of data frame but it’s not good I mean
> wrong.
> I am using Spark 1.6.1 load data from Cassandra.
> I repartition by 2 field date, network_id - 200 partitions
> I reparation by 1 field date - 200 partitions.
> but my data is data of 90 days -> I mean if we reparation by date it will
> be 90 partitions.
>
> val daily = sql
>   .read
>   .format("org.apache.spark.sql.cassandra")
>   .options(Map("table" -> dailyDetailTableName, "keyspace" -> reportSpace))
>   .load()
>   .repartition(col("date"))
>
>
>
> I mean It doesn’t change the way I put the columns to repartition.
>
> Does anyone has the same problem?
>
> Thank in advance.
>


Re: HiveContext: Unable to load AWS credentials from any provider in the chain

2016-06-08 Thread Daniel Haviv
Hi,
I've set these properties both in core-site.xml and hdfs-site.xml with no luck.

Thank you.
Daniel

> On 9 Jun 2016, at 01:11, Steve Loughran  wrote:
> 
> 
>> On 8 Jun 2016, at 16:34, Daniel Haviv  
>> wrote:
>> 
>> Hi,
>> I'm trying to create a table on s3a but I keep hitting the following error:
>> Exception in thread "main" org.apache.hadoop.hive.ql.metadata.HiveException: 
>> MetaException(message:com.cloudera.com.amazonaws.AmazonClientException: 
>> Unable to load AWS credentials from any provider in the chain)
>>  
>> I tried setting the s3a keys using the configuration object but I might be 
>> hitting SPARK-11364 :
>> conf.set("fs.s3a.access.key", accessKey)
>> conf.set("fs.s3a.secret.key", secretKey)
>> conf.set("spark.hadoop.fs.s3a.access.key",accessKey)
>> conf.set("spark.hadoop.fs.s3a.secret.key",secretKey)
>> val sc = new SparkContext(conf)
>>  
>> I tried setting these propeties in hdfs-site.xml but i'm still getting this 
>> error.
> 
> 
> 
> try core-site.xml rather than hdfs-site.xml; the latter only gets loaded when 
> an HdfsConfiguration() instances is created; it may be a bit too late.
> 
>> Finally I tried to set the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY 
>> environment variables but with no luck.
> 
> Those env vars aren't picked up directly by S3a (well, that was fixed over 
> the weekend https://issues.apache.org/jira/browse/HADOOP-12807  ); There's 
> some fixup in spark ( see 
> SparkHadoopUtil.appendS3AndSparkHadoopConfigurations() ); I don't know if 
> that is a factor; 
> 
>> Any ideas on how to resolve this issue ?
>>  
>> Thank you.
>> Daniel
>> 
>> Thank you.
>> Daniel
> 


Spark Partition by Columns doesn't work properly

2016-06-08 Thread Chanh Le
Hi everyone,
I tested the partition by columns of data frame but it’s not good I mean wrong.
I am using Spark 1.6.1 load data from Cassandra.
I repartition by 2 field date, network_id - 200 partitions
I reparation by 1 field date - 200 partitions.
but my data is data of 90 days -> I mean if we reparation by date it will be 90 
partitions.
val daily = sql
  .read
  .format("org.apache.spark.sql.cassandra")
  .options(Map("table" -> dailyDetailTableName, "keyspace" -> reportSpace))
  .load()
  .repartition(col("date"))


I mean It doesn’t change the way I put the columns to repartition.

Does anyone has the same problem? 

Thank in advance.

Re: UnsupportedOperationException: converting from RDD to DataSets on 1.6.1

2016-06-08 Thread Koert Kuipers
You can try passing in an explicit encoder:
org.apache.spark.sql.Encoders.kryo[Set[com.wix.accord.Violation]]

Although this might only be available in spark 2, i don't remember top of
my head...

On Wed, Jun 8, 2016 at 11:57 PM, Koert Kuipers  wrote:

> Sets are not supported. you basically need to stick to products (tuples,
> case classes), Seq and Map (and in spark 2 also Option).
>
> Or you can need to resort to the kryo-based encoder.
>
> On Wed, Jun 8, 2016 at 3:45 PM, Peter Halliday  wrote:
>
>> I have some code that was producing OOM during shuffle and was RDD.  So,
>> upon direction by a member of Databricks I started covering to Datasets.
>> However, when we did we are getting an error that seems to be not liking
>> something within one of our case classes.
>>
>> Peter Halliday
>>
>>
>> [2016-06-08 19:12:22,083] ERROR
>> org.apache.spark.deploy.yarn.ApplicationMaster [Driverhread] - User class
>> threw exception: java.lang.UnsupportedOperationException: No Encoder found
>> for Set[com.wix.accord.Violation]
>> - field (class: "scala.collection.immutable.Set", name: "violations")
>> - root class: "com.here.probe.ingestion.converters.ProbeValidation"
>> java.lang.UnsupportedOperationException: No Encoder found for
>> Set[com.wix.accord.Violation]
>> - field (class: "scala.collection.immutable.Set", name: "violations")
>> - root class: "com.here.probe.ingestion.converters.ProbeValidation"
>> at
>> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$extractorFor(ScalaReflection.scala:594)
>> at
>> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:494)
>> at
>> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:490)
>> at
>> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>> at
>> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>> at scala.collection.immutable.List.foreach(List.scala:318)
>> at
>> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
>> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
>> at
>> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$extractorFor(ScalaReflection.scala:490)
>> at
>> org.apache.spark.sql.catalyst.ScalaReflection$.extractorsFor(ScalaReflection.scala:402)
>> at
>> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:54)
>> at
>> org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:41)
>> at
>> com.here.probe.ingestion.IngestProbe.processLines(IngestProbe.scala:116)
>> at com.here.probe.ingestion.IngestProbe.processFiles(IngestProbe.scala:86)
>> at com.here.probe.ingestion.IngestProbe$.main(IngestProbe.scala:53)
>> at com.here.probe.ingestion.IngestProbe.main(IngestProbe.scala)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at
>> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:542)
>> [2016-06-08 19:12:22,086] INFO
>> org.apache.spark.deploy.yarn.ApplicationMaster [Driverhread] - Final app
>> status: FAILED, exitCode: 15, (reason: User class threw exception:
>> java.lang.UnsupportedOperationException: No Encoder found for
>> Set[com.wix.accord.Violation]
>> - field (class: "scala.collection.immutable.Set", name: "violations")
>> - root class: "com.here.probe.ingestion.converters.ProbeValidation”)
>>
>
>


Re: UnsupportedOperationException: converting from RDD to DataSets on 1.6.1

2016-06-08 Thread Koert Kuipers
Sets are not supported. you basically need to stick to products (tuples,
case classes), Seq and Map (and in spark 2 also Option).

Or you can need to resort to the kryo-based encoder.

On Wed, Jun 8, 2016 at 3:45 PM, Peter Halliday  wrote:

> I have some code that was producing OOM during shuffle and was RDD.  So,
> upon direction by a member of Databricks I started covering to Datasets.
> However, when we did we are getting an error that seems to be not liking
> something within one of our case classes.
>
> Peter Halliday
>
>
> [2016-06-08 19:12:22,083] ERROR
> org.apache.spark.deploy.yarn.ApplicationMaster [Driverhread] - User class
> threw exception: java.lang.UnsupportedOperationException: No Encoder found
> for Set[com.wix.accord.Violation]
> - field (class: "scala.collection.immutable.Set", name: "violations")
> - root class: "com.here.probe.ingestion.converters.ProbeValidation"
> java.lang.UnsupportedOperationException: No Encoder found for
> Set[com.wix.accord.Violation]
> - field (class: "scala.collection.immutable.Set", name: "violations")
> - root class: "com.here.probe.ingestion.converters.ProbeValidation"
> at
> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$extractorFor(ScalaReflection.scala:594)
> at
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:494)
> at
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:490)
> at
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at
> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$extractorFor(ScalaReflection.scala:490)
> at
> org.apache.spark.sql.catalyst.ScalaReflection$.extractorsFor(ScalaReflection.scala:402)
> at
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:54)
> at
> org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:41)
> at com.here.probe.ingestion.IngestProbe.processLines(IngestProbe.scala:116)
> at com.here.probe.ingestion.IngestProbe.processFiles(IngestProbe.scala:86)
> at com.here.probe.ingestion.IngestProbe$.main(IngestProbe.scala:53)
> at com.here.probe.ingestion.IngestProbe.main(IngestProbe.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:542)
> [2016-06-08 19:12:22,086] INFO
> org.apache.spark.deploy.yarn.ApplicationMaster [Driverhread] - Final app
> status: FAILED, exitCode: 15, (reason: User class threw exception:
> java.lang.UnsupportedOperationException: No Encoder found for
> Set[com.wix.accord.Violation]
> - field (class: "scala.collection.immutable.Set", name: "violations")
> - root class: "com.here.probe.ingestion.converters.ProbeValidation”)
>


Re: [ Standalone Spark Cluster ] - Track node status

2016-06-08 Thread Mich Talebzadeh
The other way is to log in to the individual nodes and do

 jps

24819 Worker

And you Processes identified as worker

Also you can use jmonitor to see what they are doing resource wise

You can of course write a small shell script to see if Worker(s) are up and
running in every node and alert if they are down?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 9 June 2016 at 01:27, Rutuja Kulkarni 
wrote:

> Thank you for the quick response.
> So the workers section would list all the running worker nodes in the
> standalone Spark cluster?
> I was also wondering if this is the only way to retrieve worker nodes or
> is there something like a Web API or CLI I could use?
> Thanks.
>
> Regards,
> Rutuja
>
> On Wed, Jun 8, 2016 at 4:02 PM, Mich Talebzadeh  > wrote:
>
>> check port 8080 on the node that you started start-master.sh
>>
>>
>>
>> [image: Inline images 2]
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 8 June 2016 at 23:56, Rutuja Kulkarni 
>> wrote:
>>
>>> Hello!
>>>
>>> I'm trying to setup a standalone spark cluster and wondering how to
>>> track status of all of it's nodes. I wonder if something like Yarn REST API
>>> or HDFS CLI exists in Spark world that can provide status of nodes on such
>>> a cluster. Any pointers would be greatly appreciated.
>>>
>>> --
>>> *Regards,*
>>> *Rutuja Kulkarni*
>>>
>>>
>>>
>>
>
>
> --
> *Regards,*
> *Rutuja Kulkarni*
>
>
>


Re: [ Standalone Spark Cluster ] - Track node status

2016-06-08 Thread Rutuja Kulkarni
Thank you for the quick response.
So the workers section would list all the running worker nodes in the
standalone Spark cluster?
I was also wondering if this is the only way to retrieve worker nodes or is
there something like a Web API or CLI I could use?
Thanks.

Regards,
Rutuja

On Wed, Jun 8, 2016 at 4:02 PM, Mich Talebzadeh 
wrote:

> check port 8080 on the node that you started start-master.sh
>
>
>
> [image: Inline images 2]
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 8 June 2016 at 23:56, Rutuja Kulkarni 
> wrote:
>
>> Hello!
>>
>> I'm trying to setup a standalone spark cluster and wondering how to track
>> status of all of it's nodes. I wonder if something like Yarn REST API or
>> HDFS CLI exists in Spark world that can provide status of nodes on such a
>> cluster. Any pointers would be greatly appreciated.
>>
>> --
>> *Regards,*
>> *Rutuja Kulkarni*
>>
>>
>>
>


-- 
*Regards,*
*Rutuja Kulkarni*


Re: Creating a Hive table through Spark and potential locking issue (a bug)

2016-06-08 Thread Mich Talebzadeh
Hi,

Just to clarify I use Hive with Spark engine (default) so Hive on Spark
engine as we discussed and observed.

Now with regard to Spark (as an app NOT execution engine) doing the create
table in Hive and populating it, I don't think Spark itself does any
transactional enforcement. This means that Spark assumes no concurrency
for Hive table. It is probably the same reason why updates/deletes to Hive
ORC transactional tables through Spark fail.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 9 June 2016 at 00:46, Eugene Koifman  wrote:

> Locks in Hive are acquired by the query complier and should be independent
> of the execution engine.
> Having said that, I’ve not tried this on Spark, so my answer is only
> accurate with Hive.
>
> Eugene
>
>
> From: Michael Segel 
> Reply-To: "u...@hive.apache.org" 
> Date: Wednesday, June 8, 2016 at 3:42 PM
> To: "u...@hive.apache.org" 
> Cc: David Newberger , "user @spark" <
> user@spark.apache.org>
> Subject: Re: Creating a Hive table through Spark and potential locking
> issue (a bug)
>
>
> On Jun 8, 2016, at 3:35 PM, Eugene Koifman 
> wrote:
>
> if you split “create table test.dummy as select * from oraclehadoop.dummy;
> ”
> into create table statement, followed by insert into test.dummy as select…
> you should see the behavior you expect with Hive.
> Drop statement will block while insert is running.
>
> Eugene
>
>
> OK, assuming true…
>
> Then the ddl statement is blocked because Hive sees the table in use.
>
> If you can confirm this to be the case, and if you can confirm the same
> for spark and then you can drop the table while spark is running, then you
> would have a bug since Spark in the hive context doesn’t set any locks or
> improperly sets locks.
>
> I would have to ask which version of hive did you build spark against?
> That could be another factor.
>
> HTH
>
> -Mike
>
>
>


Spark 2.0 Streaming and Event Time

2016-06-08 Thread Chang Lim
Hi All,

Does Spark 2.0 Streaming [sqlContext.read.format(...).stream(...)] support
Event Time?  In TD's Spark Summit talk yesterday, this is listed as a 2.0
feature.  Of so, where is the API or how to set it?

Thanks in advanced,
Chang



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-Streaming-and-Event-Time-tp27120.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [ Standalone Spark Cluster ] - Track node status

2016-06-08 Thread Mich Talebzadeh
check port 8080 on the node that you started start-master.sh



[image: Inline images 2]

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 8 June 2016 at 23:56, Rutuja Kulkarni 
wrote:

> Hello!
>
> I'm trying to setup a standalone spark cluster and wondering how to track
> status of all of it's nodes. I wonder if something like Yarn REST API or
> HDFS CLI exists in Spark world that can provide status of nodes on such a
> cluster. Any pointers would be greatly appreciated.
>
> --
> *Regards,*
> *Rutuja Kulkarni*
>
>
>


[ Standalone Spark Cluster ] - Track node status

2016-06-08 Thread Rutuja Kulkarni
Hello!

I'm trying to setup a standalone spark cluster and wondering how to track
status of all of it's nodes. I wonder if something like Yarn REST API or
HDFS CLI exists in Spark world that can provide status of nodes on such a
cluster. Any pointers would be greatly appreciated.

-- 
*Regards,*
*Rutuja Kulkarni*


Re: Creating a Hive table through Spark and potential locking issue (a bug)

2016-06-08 Thread Mich Talebzadeh
OK this seems to work.


   1. Create the target table first
   2.  Populate afterwards

 I first created the target table with

hive> create table test.dummy as select * from oraclehadoop.dummy where 1 =
2;

 Then did  INSERT/SELECT and tried to drop the target table when DML
(INSERT/SELECT) was going on

Now the process 6856 (drop table ..)  is waiting for the locks to be
released which is correct


Lock ID DatabaseTable   Partition   State   TypeTransaction
ID  Last Hearbeat   Acquired At UserHostname
6855testdummy   NULLACQUIREDSHARED_READ NULL
1465425703092   1465425703054   hduser  rhes564
6855oraclehadoopdummy   NULLACQUIREDSHARED_READ
NULL1465425703092   1465425703056   hduser  rhes564
6856testdummy   NULLWAITING EXCLUSIVE   NULL
1465425820073   NULLhduser  rhes564

Sounds like with Hive there is the issue with DDL + DML locks applied in a
single transaction i.e. --> create table A as select * from b

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 8 June 2016 at 23:35, Eugene Koifman  wrote:

> if you split “create table test.dummy as select * from oraclehadoop.dummy;
> ”
> into create table statement, followed by insert into test.dummy as select…
> you should see the behavior you expect with Hive.
> Drop statement will block while insert is running.
>
> Eugene
>
> From: Mich Talebzadeh 
> Reply-To: "u...@hive.apache.org" 
> Date: Wednesday, June 8, 2016 at 3:12 PM
> To: Michael Segel 
> Cc: David Newberger , "u...@hive.apache.org"
> , "user @spark" 
> Subject: Re: Creating a Hive table through Spark and potential locking
> issue (a bug)
>
> Hive version is 2
>
> We can discuss all sorts of scenarios.  However, Hivek is pretty good at
> applying the locks at both the table and partition level. The idea of
> having a metadata is to enforce these rules.
>
> [image: Inline images 1]
>
> For example above inserting from source to target table partitioned (year,
> month) shows that locks are applied correctly
>
> This is Hive running on Spark engine. The crucial point is that Hive
> accesses its metadata and updates its hive_locks table. Again one can see
> from data held in that table in metadata
>
> [image: Inline images 2]
>
> So I think there is a genuine issue here
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 8 June 2016 at 22:36, Michael Segel  wrote:
>
>> Hi,
>>
>> Lets take a step back…
>>
>> Which version of Hive?
>>
>> Hive recently added transaction support so you have to know your
>> isolation level.
>>
>> Also are you running spark as your execution engine, or are you talking
>> about a spark app running w a hive context and then you drop the table from
>> within a Hive shell while the spark app is still running?
>>
>> You also have two different things happening… you’re mixing a DDL with a
>> query.  How does hive know you have another app reading from the table?
>> I mean what happens when you try a select * from foo; and in another
>> shell try dropping foo?  and if you want to simulate a m/r job add
>> something like an order by 1 clause.
>>
>> HTH
>>
>> -Mike
>>
>>
>>
>> On Jun 8, 2016, at 1:44 PM, Mich Talebzadeh 
>> wrote:
>>
>> Hi,
>>
>> The idea of accessing Hive metada is to be aware of concurrency.
>>
>>
>>  In generall if I do the following In Hive
>>
>>
>> hive> create table test.dummy as select * from oraclehadoop.dummy;
>>
>>
>> We can see that hive applies the locks in Hive
>>
>>
>> 
>>
>>
>>
>>
>>
>> However, there seems to be an issue. *I do not see any exclusive lock on
>> the target table* (i.e. test.dummy). The locking type SHARED_READ on
>> source table oraclehadoop.dummy looks OK
>>
>>
>>  One can see the locks  in Hive database
>>
>>
>>
>>
>> 
>>
>>
>>
>>
>> So there are few issues here:
>>
>>
>>1. With Hive -> The source table is locked as SHARED_READ
>>2. With Spark --> No locks at all
>>3. With HIVE --> No locks on the target table
>>4. With Spark --> No locks at all
>>
>>  HTH
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 8 June 2016 at 20:22, 

Re: Creating a Hive table through Spark and potential locking issue (a bug)

2016-06-08 Thread Michael Segel

> On Jun 8, 2016, at 3:35 PM, Eugene Koifman  wrote:
> 
> if you split “create table test.dummy as select * from oraclehadoop.dummy;”
> into create table statement, followed by insert into test.dummy as select… 
> you should see the behavior you expect with Hive.
> Drop statement will block while insert is running.
> 
> Eugene
> 

OK, assuming true… 

Then the ddl statement is blocked because Hive sees the table in use. 

If you can confirm this to be the case, and if you can confirm the same for 
spark and then you can drop the table while spark is running, then you would 
have a bug since Spark in the hive context doesn’t set any locks or improperly 
sets locks. 

I would have to ask which version of hive did you build spark against?  
That could be another factor.

HTH

-Mike




Re: Write Ahead Log

2016-06-08 Thread Mohit Anchlia
Is there any specific reason why this feature is only supported in
streaming?

On Wed, Jun 8, 2016 at 3:24 PM, Ted Yu  wrote:

> There was a minor typo in the name of the config:
>
> spark.streaming.receiver.writeAheadLog.enable
>
> Yes, it only applies to Streaming.
>
> On Wed, Jun 8, 2016 at 3:14 PM, Mohit Anchlia 
> wrote:
>
>> Is something similar to park.streaming.receiver.writeAheadLog.enable
>> available on SparkContext? It looks like it only works for spark streaming.
>>
>
>


Re: Write Ahead Log

2016-06-08 Thread Ted Yu
There was a minor typo in the name of the config:

spark.streaming.receiver.writeAheadLog.enable

Yes, it only applies to Streaming.

On Wed, Jun 8, 2016 at 3:14 PM, Mohit Anchlia 
wrote:

> Is something similar to park.streaming.receiver.writeAheadLog.enable
> available on SparkContext? It looks like it only works for spark streaming.
>


Write Ahead Log

2016-06-08 Thread Mohit Anchlia
Is something similar to park.streaming.receiver.writeAheadLog.enable
available on SparkContext? It looks like it only works for spark streaming.


Re: Creating a Hive table through Spark and potential locking issue (a bug)

2016-06-08 Thread Mich Talebzadeh
Hive version is 2

We can discuss all sorts of scenarios.  However, Hivek is pretty good at
applying the locks at both the table and partition level. The idea of
having a metadata is to enforce these rules.

[image: Inline images 1]

For example above inserting from source to target table partitioned (year,
month) shows that locks are applied correctly

This is Hive running on Spark engine. The crucial point is that Hive
accesses its metadata and updates its hive_locks table. Again one can see
from data held in that table in metadata

[image: Inline images 2]

So I think there is a genuine issue here

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 8 June 2016 at 22:36, Michael Segel  wrote:

> Hi,
>
> Lets take a step back…
>
> Which version of Hive?
>
> Hive recently added transaction support so you have to know your isolation
> level.
>
> Also are you running spark as your execution engine, or are you talking
> about a spark app running w a hive context and then you drop the table from
> within a Hive shell while the spark app is still running?
>
> You also have two different things happening… you’re mixing a DDL with a
> query.  How does hive know you have another app reading from the table?
> I mean what happens when you try a select * from foo; and in another shell
> try dropping foo?  and if you want to simulate a m/r job add something like
> an order by 1 clause.
>
> HTH
>
> -Mike
>
>
>
> On Jun 8, 2016, at 1:44 PM, Mich Talebzadeh 
> wrote:
>
> Hi,
>
> The idea of accessing Hive metada is to be aware of concurrency.
>
>
>  In generall if I do the following In Hive
>
>
> hive> create table test.dummy as select * from oraclehadoop.dummy;
>
>
> We can see that hive applies the locks in Hive
>
>
> 
>
>
>
>
>
> However, there seems to be an issue. *I do not see any exclusive lock on
> the target table* (i.e. test.dummy). The locking type SHARED_READ on
> source table oraclehadoop.dummy looks OK
>
>
>  One can see the locks  in Hive database
>
>
>
>
> 
>
>
>
>
> So there are few issues here:
>
>
>1. With Hive -> The source table is locked as SHARED_READ
>2. With Spark --> No locks at all
>3. With HIVE --> No locks on the target table
>4. With Spark --> No locks at all
>
>  HTH
>
>
>
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 8 June 2016 at 20:22, David Newberger 
> wrote:
>
>> Could you be looking at 2 jobs trying to use the same file and one
>> getting to it before the other and finally removing it?
>>
>>
>>
>> *David Newberger*
>>
>>
>>
>> *From:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
>> *Sent:* Wednesday, June 8, 2016 1:33 PM
>> *To:* user; user @spark
>> *Subject:* Creating a Hive table through Spark and potential locking
>> issue (a bug)
>>
>>
>>
>>
>>
>> Hi,
>>
>>
>>
>> I noticed an issue with Spark creating and populating a Hive table.
>>
>>
>>
>> The process as I see is as follows:
>>
>>
>>
>>1. Spark creates the Hive table. In this case an ORC table in a Hive
>>Database
>>2. Spark uses JDBC connection to get data out from an Oracle
>>3. I create a temp table in Spark through (registerTempTable)
>>4. Spark populates that table. That table is actually created in
>>
>>hdfs dfs -ls /tmp/hive/hduser
>>
>>drwx--   - hduser supergroup
>>
>>/tmp/hive/hduser/b1ea6829-790f-4b37-a0ff-3ed218388059
>>
>>
>>
>>
>>
>>1. However, The original table itself does not have any locking on it!
>>2. I log in into Hive and drop that table
>>
>> 3. hive> drop table dummy;
>>
>> OK
>>
>>
>>
>>1.  That table is dropped OK
>>2. Spark crashes with message
>>
>> Started at
>> [08/06/2016 18:37:53.53]
>> 16/06/08 19:13:46 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID
>> 1)
>>
>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>> No lease on
>> /user/hive/warehouse/oraclehadoop.db/dummy/.hive-staging_hive_2016-06-08_18-38-08_804_3299712811201460314-1/-ext-1/_temporary/0/_temporary/attempt_201606081838_0001_m_00_0/part-0
>> (inode 831621): File does not exist. Holder
>> DFSClient_NONMAPREDUCE_-1836386597_1 does not have any open files.
>> at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3516)
>> at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:3313)
>> at
>> 

Re: HiveContext: Unable to load AWS credentials from any provider in the chain

2016-06-08 Thread Steve Loughran

On 8 Jun 2016, at 16:34, Daniel Haviv 
> wrote:

Hi,
I'm trying to create a table on s3a but I keep hitting the following error:
Exception in thread "main" org.apache.hadoop.hive.ql.metadata.HiveException: 
MetaException(message:com.cloudera.com.amazonaws.AmazonClientException: Unable 
to load AWS credentials from any provider in the chain)



I tried setting the s3a keys using the configuration object but I might be 
hitting SPARK-11364 :

conf.set("fs.s3a.access.key", accessKey)
conf.set("fs.s3a.secret.key", secretKey)
conf.set("spark.hadoop.fs.s3a.access.key",accessKey)
conf.set("spark.hadoop.fs.s3a.secret.key",secretKey)

val sc = new SparkContext(conf)



I tried setting these propeties in hdfs-site.xml but i'm still getting this 
error.



try core-site.xml rather than hdfs-site.xml; the latter only gets loaded when 
an HdfsConfiguration() instances is created; it may be a bit too late.

Finally I tried to set the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY environment 
variables but with no luck.




Those env vars aren't picked up directly by S3a (well, that was fixed over the 
weekend https://issues.apache.org/jira/browse/HADOOP-12807  ); There's some 
fixup in spark ( see SparkHadoopUtil.appendS3AndSparkHadoopConfigurations() ); 
I don't know if that is a factor;

Any ideas on how to resolve this issue ?



Thank you.
Daniel

Thank you.
Daniel



Re: Creating a Hive table through Spark and potential locking issue (a bug)

2016-06-08 Thread Michael Segel
Doh! It would help if I use the email address to send to the list… 


Hi, 

Lets take a step back… 

Which version of Hive? 

Hive recently added transaction support so you have to know your isolation 
level. 

Also are you running spark as your execution engine, or are you talking about a 
spark app running w a hive context and then you drop the table from within a 
Hive shell while the spark app is still running? 

And you also have two different things happening… you’re mixing a DDL with a 
query.  How does hive know you have another app reading from the table? 
I mean what happens when you try a select * from foo; and in another shell try 
dropping foo?  and if you want to simulate a m/r job add something like an 
order by 1 clause. 

HTH

-Mike
> On Jun 8, 2016, at 2:36 PM, Michael Segel  wrote:
> 
> Hi, 
> 
> Lets take a step back… 
> 
> Which version of Hive? 
> 
> Hive recently added transaction support so you have to know your isolation 
> level. 
> 
> Also are you running spark as your execution engine, or are you talking about 
> a spark app running w a hive context and then you drop the table from within 
> a Hive shell while the spark app is still running? 
> 
> You also have two different things happening… you’re mixing a DDL with a 
> query.  How does hive know you have another app reading from the table? 
> I mean what happens when you try a select * from foo; and in another shell 
> try dropping foo?  and if you want to simulate a m/r job add something like 
> an order by 1 clause. 
> 
> HTH
> 
> -Mike



Re: Creating a Hive table through Spark and potential locking issue (a bug)

2016-06-08 Thread Mich Talebzadeh
Hi,


The idea of accessing Hive metada is to be aware of concurrency.



 In generall if I do the following In Hive



hive> create table test.dummy as select * from oraclehadoop.dummy;



We can see that hive applies the locks in Hive



[image: Inline images 2]




However, there seems to be an issue. *I do not see any exclusive lock on
the target table* (i.e. test.dummy). The locking type SHARED_READ on source
table oraclehadoop.dummy looks OK



 One can see the locks  in Hive database



[image: Inline images 1]



So there are few issues here:


   1. With Hive -> The source table is locked as SHARED_READ
   2. With Spark --> No locks at all
   3. With HIVE --> No locks on the target table
   4. With Spark --> No locks at all

 HTH







Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 8 June 2016 at 20:22, David Newberger 
wrote:

> Could you be looking at 2 jobs trying to use the same file and one getting
> to it before the other and finally removing it?
>
>
>
> *David Newberger*
>
>
>
> *From:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
> *Sent:* Wednesday, June 8, 2016 1:33 PM
> *To:* user; user @spark
> *Subject:* Creating a Hive table through Spark and potential locking
> issue (a bug)
>
>
>
>
>
> Hi,
>
>
>
> I noticed an issue with Spark creating and populating a Hive table.
>
>
>
> The process as I see is as follows:
>
>
>
>1. Spark creates the Hive table. In this case an ORC table in a Hive
>Database
>2. Spark uses JDBC connection to get data out from an Oracle
>3. I create a temp table in Spark through (registerTempTable)
>4. Spark populates that table. That table is actually created in
>
>hdfs dfs -ls /tmp/hive/hduser
>
>drwx--   - hduser supergroup
>
>/tmp/hive/hduser/b1ea6829-790f-4b37-a0ff-3ed218388059
>
>
>
>
>
>1. However, The original table itself does not have any locking on it!
>2. I log in into Hive and drop that table
>
> 3. hive> drop table dummy;
>
> OK
>
>
>
>1.  That table is dropped OK
>2. Spark crashes with message
>
> Started at
> [08/06/2016 18:37:53.53]
> 16/06/08 19:13:46 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID
> 1)
>
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> No lease on
> /user/hive/warehouse/oraclehadoop.db/dummy/.hive-staging_hive_2016-06-08_18-38-08_804_3299712811201460314-1/-ext-1/_temporary/0/_temporary/attempt_201606081838_0001_m_00_0/part-0
> (inode 831621): File does not exist. Holder
> DFSClient_NONMAPREDUCE_-1836386597_1 does not have any open files.
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3516)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:3313)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3169)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:641)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:482)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
>
> at org.apache.hadoop.ipc.Client.call(Client.java:1468)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy22.addBlock(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399)
> at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>   

Variable in UpdateStateByKey Not Updating After Restarting Application from Checkpoint

2016-06-08 Thread Joe Panciera
I've run into an issue where a global variable used within an
UpdateStateByKey function isn't being assigned after the application
restarts from a checkpoint. Using ForEachRDD I have a global variable 'A'
that is propagated from a file every time a batch runs, and A is then used
in an UpdateStateByKey. When I initially run the application, it functions
as expected and the value of A is referenced correctly within the scope of
the update function.

However, when I bring the application down and restart, I see a different
behavior. Variable A is assigned the correct value by its corresponding
ForEachRDD function, but when the UpdateStateByKey function is executed the
new value for A isn't used. It just... disappears.

I could be going about the implementation of this wrong, but I'm hoping
that someone can point me in the correct direction.

Here's some pseudocode:

def readfile(rdd):

global A
a = readFromFile

def update(new, old)

if old in A:
do something


dstream.forEachRDD(readfile)
dstream.updateStateByKey(update)

ssc.checkpoint('checkpoint')

A is correct the first time this is run, but when the application is killed
and restarted A doesn't seem to be reassigned correctly.


UDTRegistration

2016-06-08 Thread pgrandjean
Hi, I discovered the following scala object on the master branch:

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/UDTRegistration.scala

It is currently a private object. In which Apache Spark version is it
planned to be released as public?

Thanks!
Patrick. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/UDTRegistration-tp27119.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



UnsupportedOperationException: converting from RDD to DataSets on 1.6.1

2016-06-08 Thread Peter Halliday
I have some code that was producing OOM during shuffle and was RDD.  So, upon 
direction by a member of Databricks I started covering to Datasets.  However, 
when we did we are getting an error that seems to be not liking something 
within one of our case classes.

Peter Halliday


[2016-06-08 19:12:22,083] ERROR org.apache.spark.deploy.yarn.ApplicationMaster 
[Driverhread] - User class threw exception: 
java.lang.UnsupportedOperationException: No Encoder found for 
Set[com.wix.accord.Violation]
- field (class: "scala.collection.immutable.Set", name: "violations")
- root class: "com.here.probe.ingestion.converters.ProbeValidation"
java.lang.UnsupportedOperationException: No Encoder found for 
Set[com.wix.accord.Violation]
- field (class: "scala.collection.immutable.Set", name: "violations")
- root class: "com.here.probe.ingestion.converters.ProbeValidation"
at 
org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$extractorFor(ScalaReflection.scala:594)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:494)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:490)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$extractorFor(ScalaReflection.scala:490)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.extractorsFor(ScalaReflection.scala:402)
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:54)
at 
org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:41)
at 
com.here.probe.ingestion.IngestProbe.processLines(IngestProbe.scala:116)
at 
com.here.probe.ingestion.IngestProbe.processFiles(IngestProbe.scala:86)
at com.here.probe.ingestion.IngestProbe$.main(IngestProbe.scala:53)
at com.here.probe.ingestion.IngestProbe.main(IngestProbe.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:542)
[2016-06-08 19:12:22,086] INFO org.apache.spark.deploy.yarn.ApplicationMaster 
[Driverhread] - Final app status: FAILED, exitCode: 15, (reason: User class 
threw exception: java.lang.UnsupportedOperationException: No Encoder found for 
Set[com.wix.accord.Violation]
- field (class: "scala.collection.immutable.Set", name: "violations")
- root class: "com.here.probe.ingestion.converters.ProbeValidation”)

RE: Creating a Hive table through Spark and potential locking issue (a bug)

2016-06-08 Thread David Newberger
Could you be looking at 2 jobs trying to use the same file and one getting to 
it before the other and finally removing it?

David Newberger

From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: Wednesday, June 8, 2016 1:33 PM
To: user; user @spark
Subject: Creating a Hive table through Spark and potential locking issue (a bug)


Hi,

I noticed an issue with Spark creating and populating a Hive table.

The process as I see is as follows:


  1.  Spark creates the Hive table. In this case an ORC table in a Hive Database
  2.  Spark uses JDBC connection to get data out from an Oracle
  3.  I create a temp table in Spark through (registerTempTable)
  4.  Spark populates that table. That table is actually created in

   hdfs dfs -ls /tmp/hive/hduser
   drwx--   - hduser supergroup
   /tmp/hive/hduser/b1ea6829-790f-4b37-a0ff-3ed218388059



  1.  However, The original table itself does not have any locking on it!
  2.  I log in into Hive and drop that table

3. hive> drop table dummy;
OK


  1.   That table is dropped OK
  2.  Spark crashes with message
Started at
[08/06/2016 18:37:53.53]
16/06/08 19:13:46 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
 No lease on 
/user/hive/warehouse/oraclehadoop.db/dummy/.hive-staging_hive_2016-06-08_18-38-08_804_3299712811201460314-1/-ext-1/_temporary/0/_temporary/attempt_201606081838_0001_m_00_0/part-0
 (inode 831621): File does not exist. Holder 
DFSClient_NONMAPREDUCE_-1836386597_1 does not have any open files.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3516)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:3313)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3169)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:641)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:482)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
at org.apache.hadoop.ipc.Client.call(Client.java:1468)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy22.addBlock(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399)
at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy23.addBlock(Unknown Source)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1532)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1349)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:588)
16/06/08 19:13:46 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times; 
aborting job

Suggested solution.
In a concurrent env, Spark should apply locks in order to prevent such 
operations. Locks are kept in Hive meta data table HIVE_LOCKS

HTH
Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com




Creating a Hive table through Spark and potential locking issue (a bug)

2016-06-08 Thread Mich Talebzadeh
Hi,

I noticed an issue with Spark creating and populating a Hive table.

The process as I see is as follows:


   1. Spark creates the Hive table. In this case an ORC table in a Hive
   Database
   2. Spark uses JDBC connection to get data out from an Oracle
   3. I create a temp table in Spark through (registerTempTable)
   4. Spark populates that table. That table is actually created in

   hdfs dfs -ls /tmp/hive/hduser
   drwx--   - hduser supergroup
   /tmp/hive/hduser/b1ea6829-790f-4b37-a0ff-3ed218388059



   1. However, The original table itself does not have any locking on it!
   2. I log in into Hive and drop that table
   3.

   hive> drop table dummy;
   OK

   4.  That table is dropped OK
   5. Spark crashes with message

Started at
[08/06/2016 18:37:53.53]
16/06/08 19:13:46 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
No lease on
/user/hive/warehouse/oraclehadoop.db/dummy/.hive-staging_hive_2016-06-08_18-38-08_804_3299712811201460314-1/-ext-1/_temporary/0/_temporary/attempt_201606081838_0001_m_00_0/part-0
(inode 831621): File does not exist. Holder
DFSClient_NONMAPREDUCE_-1836386597_1 does not have any open files.
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3516)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:3313)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3169)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:641)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:482)
at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
at org.apache.hadoop.ipc.Client.call(Client.java:1468)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy22.addBlock(Unknown Source)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399)
at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy23.addBlock(Unknown Source)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1532)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1349)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:588)
16/06/08 19:13:46 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times;
aborting job

Suggested solution.
In a concurrent env, Spark should apply locks in order to prevent such
operations. Locks are kept in Hive meta data table HIVE_LOCKS

HTH
Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


Re: When queried through hiveContext, does hive executes these queries using its execution engine (default is map-reduce), or spark just reads the data and performs those queries itself?

2016-06-08 Thread lalit sharma
To add on what Vikash said above, bit more internals :
1. There are 2 components which work together to achieve Hive + Spark
integration
   a. HiveContext which extends SqlContext adds logic to add hive specific
things e.g. loading jars to talk to underlying metastore db, load configs
in hive-site.xml
   b. HiveThriftServer2 which uses native HiveServer2 and add logic for
creating sessions, handling operations.
2. Once thrift server is up , authentication , session management is all
delegated to Hive classes. Once parsing of query is done and logical plan
is created and passed on to create DataFrame.

So no mapReduce , spark intelligently uses needed pieces from Hive and use
its own execution engine.

--Regards,
Lalit

On Wed, Jun 8, 2016 at 9:59 PM, Vikash Pareek  wrote:

> Himanshu,
>
> Spark doesn't use hive execution engine (Map Reduce) to execute query.
> Spark
> only reads the meta data from hive meta store db and executes the query
> within Spark execution engine. This meta data is used by Spark's own SQL
> execution engine (this includes components such as catalyst, tungsten to
> optimize queries) to execute query and generate result faster than hive
> (Map
> Reduce).
>
> Using HiveContext means connecting to hive meta store db. Thus, HiveContext
> can access hive meta data, and hive meta data includes location of data,
> serialization and de-serializations, compression codecs, columns, datatypes
> etc. thus, Spark have enough information about the hive tables and it's
> data
> to understand the target data and execute the query over its on execution
> engine.
>
> Overall, Spark replaced the Map Reduce model completely by it's
> in-memory(RDD) computation engine.
>
> - Vikash Pareek
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/When-queried-through-hiveContext-does-hive-executes-these-queries-using-its-execution-engine-default-tp27114p27117.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Dealing with failures

2016-06-08 Thread Mohit Anchlia
On Wed, Jun 8, 2016 at 3:42 AM, Jacek Laskowski  wrote:

> On Wed, Jun 8, 2016 at 2:38 AM, Mohit Anchlia 
> wrote:
> > I am looking to write an ETL job using spark that reads data from the
> > source, perform transformation and insert it into the destination.
>
> Is this going to be one-time job or you want it to run every time interval?
>
> > 1. Source becomes slow or un-responsive. How to control such a situation
> so
> > that it doesn't cause DDoS on the source?
>
> Why do you think Spark would DDoS the source? I'm reading it as if
> Spark tried to open a new connection after the currently-open one
> became slow. I don't think it's how Spark does connections. What is
> the source in your use case?
>

>> I was primarily concerned about retires storms causing DDoS on the
source. How does spark deal with a scenario where it gets timeout from the
source. Does it retry or does it fail? And if the task fail does it fail
the job. And is it possible to restart the job and only process the failed
tasks and the remaining pending tasks? My use case is reading from
Cassandra performing some transformation and saving the data to a different
Cassandra cluster. I want to make sure that the data is reliably copied
without missing data. At the same time also want to make sure that the
process doesn't cause performance impact to other live production traffic
to these clusters when there are failures eg: DDoS or retry storms.


> > Also, at the same time how to make it resilient that it does pick up
> from where it left?
>
> It sounds like checkpointing. It's available in Core and Streaming.
> So, what's your source and how often do you want to query for data?
> You may also benefit from the recent additions to Spark in 2.0 called
> Structured Streaming (aka Streaming Datasets) - see
> https://issues.apache.org/jira/browse/SPARK-8360.
>
>
>> Does checkpointing help with the failure scenario that I described
above? I read checkpointing as a way to restart processing of data if tasks
fail because of spark cluster issues. Does it also work in the scenario
that I described?


> > 2. In the same context when destination becomes slow or un-responsive.
>
> What is a destination? It appears as if you were doing streaming and
> want to use checkpointing and back-pressure. But you haven't said much
> about your use case to be precise.
>
> Jacek
>


Re: When queried through hiveContext, does hive executes these queries using its execution engine (default is map-reduce), or spark just reads the data and performs those queries itself?

2016-06-08 Thread Vikash Pareek
Himanshu,

Spark doesn't use hive execution engine (Map Reduce) to execute query. Spark
only reads the meta data from hive meta store db and executes the query
within Spark execution engine. This meta data is used by Spark's own SQL
execution engine (this includes components such as catalyst, tungsten to
optimize queries) to execute query and generate result faster than hive (Map
Reduce).

Using HiveContext means connecting to hive meta store db. Thus, HiveContext
can access hive meta data, and hive meta data includes location of data,
serialization and de-serializations, compression codecs, columns, datatypes
etc. thus, Spark have enough information about the hive tables and it's data
to understand the target data and execute the query over its on execution
engine.

Overall, Spark replaced the Map Reduce model completely by it's
in-memory(RDD) computation engine.

- Vikash Pareek



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/When-queried-through-hiveContext-does-hive-executes-these-queries-using-its-execution-engine-default-tp27114p27117.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



ChiSqSelector Selected Features Indicies

2016-06-08 Thread Sebastian Kuepers
Hi there,


what is the best way to get from:

pyspark.mllib.feature.ChiSqSelector(numTopFeatures)


the vector indices of the selected vectors from the original input vector?


Shouldn't the model contain this information?


Thanks!





Disclaimer The information in this email and any attachments may contain 
proprietary and confidential information that is intended for the addressee(s) 
only. If you are not the intended recipient, you are hereby notified that any 
disclosure, copying, distribution, retention or use of the contents of this 
information is prohibited. When addressed to our clients or vendors, any 
information contained in this e-mail or any attachments is subject to the terms 
and conditions in any governing contract. If you have received this e-mail in 
error, please immediately contact the sender and delete the e-mail.


Re: Trainning a spark ml linear regresion model fail after migrating from 1.5.2 to 1.6.1

2016-06-08 Thread philippe v
here is a gist with the minimal code and data

http://gist.github.com/anonymous/aca8ba5841404ea092f9efcc658c5d57





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Trainning-a-spark-ml-linear-regresion-model-fail-after-migrating-from-1-5-2-to-1-6-1-tp27111p27116.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



HiveContext: Unable to load AWS credentials from any provider in the chain

2016-06-08 Thread Daniel Haviv
Hi,
I'm trying to create a table on s3a but I keep hitting the following error:
Exception in thread "main" org.apache.hadoop.hive.ql.metadata.HiveException: 
MetaException(message:com.cloudera.com.amazonaws.AmazonClientException: Unable 
to load AWS credentials from any provider in the chain)
 
I tried setting the s3a keys using the configuration object but I might be 
hitting SPARK-11364 :
conf.set("fs.s3a.access.key", accessKey)
conf.set("fs.s3a.secret.key", secretKey)
conf.set("spark.hadoop.fs.s3a.access.key",accessKey)
conf.set("spark.hadoop.fs.s3a.secret.key",secretKey)
val sc = new SparkContext(conf)
 
I tried setting these propeties in hdfs-site.xml but i'm still getting this 
error.
Finally I tried to set the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY environment 
variables but with no luck.
 
Any ideas on how to resolve this issue ?
 
Thank you.
Daniel

Thank you.
Daniel

Seq.toDF vs sc.parallelize.toDF = no Spark job vs one - why?

2016-06-08 Thread Jacek Laskowski
Hi,

I just noticed it today while toying with Spark 2.0.0 (today's build)
that doing Seq(...).toDF does **not** submit a Spark job while
sc.parallelize(Seq(...)).toDF does. I was nicely surprised and been
thinking about the reason for the behaviour.

My explanation was that Datasets are just a "view" layer atop data and
when this data is local/in memory already there's no need to submit a
job to...well...compute the data.

I'd appreciate more in-depth answer, perhaps with links to the code. Thanks!

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: comparaing row in pyspark data frame

2016-06-08 Thread Jacek Laskowski
On Wed, Jun 8, 2016 at 2:05 PM, pseudo oduesp  wrote:

> how we can compare columns to get max of row not columns and get name of
> columns where max it present ?

First thought - a UDF.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



When queried through hiveContext, does hive executes these queries using its execution engine (default is map-reduce), or spark just reads the data and performs those queries itself?

2016-06-08 Thread Himanshu Mehra
So what happens underneath when we query on a hive table using hiveContext? 

1. Does Spark talks to metastore to get the data location on hdfs and read
the data from there to perform those queries?
2. Spark passes those queries to hive and hive executes those queries on the
table and returns the results to spark? In this case, might hive be using
map-reduce to execute the queries?

Please clarify this confusion. I have looked into the code seems like spark
is just fetching the data from hdfs. Please convince me otherwise.

Thanks

Best



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/When-queried-through-hiveContext-does-hive-executes-these-queries-using-its-execution-engine-default-tp27114.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: GraphX Java API

2016-06-08 Thread Felix Cheung
You might want to check out GraphFrames
graphframes.github.io





On Sun, Jun 5, 2016 at 6:40 PM -0700, "Santoshakhilesh" 
 wrote:





Ok , thanks for letting me know. Yes Since Java and scala programs ultimately 
runs on JVM. So the APIs written in one language can be called from other.
When I had used GraphX (around 2015 beginning) the Java Native APIs were not 
available for GraphX.
So I chose to develop my application in scala and it turned out much simpler to 
develop  in scala due to some of its powerful functions like lambda , map , 
filter etc… which were not available to me in Java 7.
Regards,
Santosh Akhilesh

From: Sonal Goyal [mailto:sonalgoy...@gmail.com]
Sent: 01 June 2016 00:56
To: Santoshakhilesh
Cc: Kumar, Abhishek (US - Bengaluru); user@spark.apache.org; Golatkar, Jayesh 
(US - Bengaluru); Soni, Akhil Dharamprakash (US - Bengaluru); Matta, Rishul (US 
- Bengaluru); Aich, Risha (US - Bengaluru); Kumar, Rajinish (US - Bengaluru); 
Jain, Isha (US - Bengaluru); Kumar, Sandeep (US - Bengaluru)
Subject: Re: GraphX Java API

Its very much possible to use GraphX through Java, though some boilerplate may 
be needed. Here is an example.

Create a graph from edge and vertex RDD (JavaRDD> 
vertices, JavaRDD edges )


ClassTag longTag = scala.reflect.ClassTag$.MODULE$.apply(Long.class);
Graph graph = Graph.apply(vertices.rdd(),
edges.rdd(), 0L, 
StorageLevel.MEMORY_ONLY(), StorageLevel.MEMORY_ONLY(),
longTag, longTag);



Then basically you can call graph.ops() and do available operations like 
triangleCounting etc,

Best Regards,
Sonal
Founder, Nube Technologies
Reifier at Strata Hadoop World
Reifier at Spark Summit 
2015




On Tue, May 31, 2016 at 11:40 AM, Santoshakhilesh 
> wrote:
Hi ,
Scala has similar package structure as java and finally it runs on JVM so 
probably you get an impression that its in Java.
As far as I know there are no Java API for GraphX. I had used GraphX last year 
and at that time I had to code in Scala to use the GraphX APIs.
Regards,
Santosh Akhilesh


From: Kumar, Abhishek (US - Bengaluru) 
[mailto:abhishekkuma...@deloitte.com]
Sent: 30 May 2016 13:24
To: Santoshakhilesh; user@spark.apache.org
Cc: Golatkar, Jayesh (US - Bengaluru); Soni, Akhil Dharamprakash (US - 
Bengaluru); Matta, Rishul (US - Bengaluru); Aich, Risha (US - Bengaluru); 
Kumar, Rajinish (US - Bengaluru); Jain, Isha (US - Bengaluru); Kumar, Sandeep 
(US - Bengaluru)
Subject: RE: GraphX Java API

Hey,
•   I see some graphx packages listed here:
http://spark.apache.org/docs/latest/api/java/index.html
•   
org.apache.spark.graphx
•   
org.apache.spark.graphx.impl
•   
org.apache.spark.graphx.lib
•   
org.apache.spark.graphx.util
Aren’t they meant to be used with JAVA?
Thanks

From: Santoshakhilesh [mailto:santosh.akhil...@huawei.com]
Sent: Friday, May 27, 2016 4:52 PM
To: Kumar, Abhishek (US - Bengaluru) 
>; 
user@spark.apache.org
Subject: RE: GraphX Java API

GraphX APis are available only in Scala. If you need to use GraphX you need to 
switch to Scala.

From: Kumar, Abhishek (US - Bengaluru) [mailto:abhishekkuma...@deloitte.com]
Sent: 27 May 2016 19:59
To: user@spark.apache.org
Subject: GraphX Java API

Hi,

We are trying to consume the Java API for GraphX, but there is no documentation 
available online on the usage or examples. It would be great if we could get 
some examples in Java.

Thanks and regards,

Abhishek Kumar






This message (including any attachments) contains confidential information 
intended for a specific individual and purpose, and is protected by law. If you 
are not the intended recipient, you should delete this message and any 
disclosure, copying, or distribution of this message, or the taking of any 
action based on it, by you is strictly prohibited.

v.E.1










Re: SQL JSON array operations

2016-06-08 Thread amalik
Hi @jvuillermet, 
I am encountering a similar problem. Did you manage to figure out parsing of
complicated unstructured json files. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SQL-JSON-array-operations-tp21164p27113.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Analyzing twitter data

2016-06-08 Thread Jörn Franke
You can directly load it into solr.
But think about what you want to index etc.


> On 08 Jun 2016, at 15:51, Mich Talebzadeh  wrote:
> 
> yes. use that is reasonable.
> 
> What is the format of twitter data. Is that primarily json.?
> 
> If I do
> 
> duser@rhes564: /usr/lib/nifi-0.6.1/conf> hdfs dfs -cat 
> /twitter_data/FlumeData.1464945101915|more
> 
> 16/06/08 14:48:36 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> {"type":"record","name":"Doc","doc":"adoc","fields":[{"name":"id","type":"string"},{"name":"user_friends_count","type":["int","null"]},{"name":"user_location","type":["string","null"]},{"name":"user_description","
> type":["string","null"]},{"name":"user_statuses_count","type":["int","null"]},{"name":"user_followers_count","type":["int","null"]},{"name":"user_name","type":["string","null"]},{"name":"user_screen_name","type":[
> "string","null"]},{"name":"created_at","type":["string","null"]},{"name":"text","type":["string","null"]},{"name":"retweet_count","type":["long","null"]},{"name":"retweeted","type":["boolean","null"]},{"name":"in_
> reply_to_user_id","type":["long","null"]},{"name":"source","type":["string","null"]},{"name":"in_reply_to_status_id","type":["long","null"]},{"name":"media_url_https","type":["string","null"]},{"name":"expanded_ur
> l","type":["string","null"]}]}
> ▒ぷろが説明書✋http://twpf.jp/960_krm
> 8659292711026688
> ▒便座カバー
> 男児「こちらロストボーイ1、"施設"に侵入した。セキュリティが厄介で数日かかるな。偽装工作はうまくいっているか?」
> 父親「あぁ今じゃ立派なダメ親父と可哀想な子供扱いだ。これで万一見つかってもお前は安心さ」
> 男…
> ter.com" rel="nofollow">Twitter Web Clients, and fun!
> Learning a new la... https://t.co/ejHfRcAucy
> 3-2 7番 代議員
> 木村亮太
> 知ってる人はRT
> ちびまる子ちゃんのEDですね!
> この曲は12年前になります。 https://t.co/LfLZ8xX5u9
> twitter.com/hijidora/status/737980634858029056/video/1$738659292677431296
> (ライトニングさんティナラムザアダマンA)/かんこれ/EXVS(モチベ↑雑魚後衛)/FGOその他適当
> ▒http://twitter.com/download/iphone; rel="nofollow">Twitter for 
> iPhone
> itter.com/download/android" rel="nofollow">Twitter for Android Free Lyft 
> credit with Lyft promo code LYFTLUSHpp.com" rel="nofollow">Buffer
> naliar for iPad
> third person
> 男子南ことりが大好きなラブライバーです! ラブライブ大好きな人ぜひフォローしてください
> 固定ツイートお願いします
> ラブライブに出会えて良かった!
> 9人のみんなのこと忘れない
> #LoveLiveforever
> #ラブライバーと繋がりたいRT https://t.co/kITPDLER9x
> 07114803986434/photo/1$738659292685979648
> :13Z://pbs.twimg.com/media/CkA-exTWYAAK8TU.jpg
> : 1000RT:【資金不足】「学園ハンサム」、クラウドファンディングでアニメ化支援を募集
> https://t.co/CVM2F7rNt1
> 放送局やキャストは「支援額に応じて変わる」とのこと。時期は10月から1クールと発表されている。 http…
> com/media/CkAftVyUYAA0nmn.jpg-06-03T10:11:13Z
> miga, sutiã que é do dia a dia ela só usa de ser obrigada, acha mesmo q ela 
> compraria mais d…r Promoter | Worked with @inkmonstarr @breadboi @ayookd 
> @chapobandz and more | PayPal accepted | DM for beats | Beats Starting at $10 
> |resenting August Redmoon at the Hollywood premiere of Inside Metal: The 
> Metal Scene Explodes! 落落落落 https://t.…
> .jpg
> 
> -03T10:11:13Z
>  
> I assume it is all json data. So I can use solr to build index on these files 
> and do a search?
> 
> Or alternatively use it for a staging area for Hive table?
> 
> 
> thanks
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 8 June 2016 at 14:01, Jörn Franke  wrote:
>> I mean what you should also look at is ingestion capacity. If you have a 
>> lots of irregular writes such as from sensor data, it can make sense to 
>> store them first in hbase and flush them regularly to Orc/parquet hive 
>> tables for analysis 
>> 
>>> On 08 Jun 2016, at 13:15, Mich Talebzadeh  wrote:
>>> 
>> 
>>> Interesting. There is also apache nifi
>>> 
>>> Also I note that one can store twitter data in Hive tables as well?
>>> 
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> http://talebzadehmich.wordpress.com
>>>  
>>> 
 On 7 June 2016 at 15:59, Mich Talebzadeh  wrote:
 thanks I will have a look.
 
 Mich
 
 Dr Mich Talebzadeh
  
 LinkedIn  
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
  
 http://talebzadehmich.wordpress.com
  
 
> On 7 June 2016 at 13:38, Jörn Franke  wrote:
> Solr is basically an in-memory text index with a lot of capabilities for 
> language analysis extraction (you can compare  it to a Google for your 
> tweets). The system itself has a lot of features and has a complexity 
> similar to Big data systems. This index files can be backed by HDFS. You 
> can put the tweets directly into solr without going via HDFS files.
> 
> Carefully decide what fields to index / you want to search. It does not 
> make sense to index 

Re: Analyzing twitter data

2016-06-08 Thread Mich Talebzadeh
yes. use that is reasonable.

What is the format of twitter data. Is that primarily json.?

If I do

*duser@rhes564: /usr/lib/nifi-0.6.1/conf> hdfs dfs -cat
/twitter_data/FlumeData.1464945101915|more*

16/06/08 14:48:36 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
{"type":"record","name":"Doc","doc":"adoc","fields":[{"name":"id","type":"string"},{"name":"user_friends_count","type":["int","null"]},{"name":"user_location","type":["string","null"]},{"name":"user_description","
type":["string","null"]},{"name":"user_statuses_count","type":["int","null"]},{"name":"user_followers_count","type":["int","null"]},{"name":"user_name","type":["string","null"]},{"name":"user_screen_name","type":[
"string","null"]},{"name":"created_at","type":["string","null"]},{"name":"text","type":["string","null"]},{"name":"retweet_count","type":["long","null"]},{"name":"retweeted","type":["boolean","null"]},{"name":"in_
reply_to_user_id","type":["long","null"]},{"name":"source","type":["string","null"]},{"name":"in_reply_to_status_id","type":["long","null"]},{"name":"media_url_https","type":["string","null"]},{"name":"expanded_ur
l","type":["string","null"]}]}
▒ぷろが説明書✋http://twpf.jp/960_krm
8659292711026688
▒便座カバー
男児「こちらロストボーイ1、"施設"に侵入した。セキュリティが厄介で数日かかるな。偽装工作はうまくいっているか?」
父親「あぁ今じゃ立派なダメ親父と可哀想な子供扱いだ。これで万一見つかってもお前は安心さ」
男…
ter.com" rel="nofollow">Twitter Web Clients, and fun!
Learning a new la... https://t.co/ejHfRcAucy
3-2 7番 代議員
木村亮太
知ってる人はRT
ちびまる子ちゃんのEDですね!
この曲は12年前になります。 https://t.co/LfLZ8xX5u9
twitter.com/hijidora/status/737980634858029056/video/1$738659292677431296
(ライトニングさんティナラムザアダマンA)/かんこれ/EXVS(モチベ↑雑魚後衛)/FGOその他適当
▒http://twitter.com/download/iphone; rel="nofollow">Twitter for
iPhone
itter.com/download/android" rel="nofollow">Twitter for Android Free
Lyft credit with Lyft promo code LYFTLUSHpp.com" rel="nofollow">Buffer
naliar for iPad
third person
男子南ことりが大好きなラブライバーです! ラブライブ大好きな人ぜひフォローしてください
固定ツイートお願いします
ラブライブに出会えて良かった!
9人のみんなのこと忘れない
#LoveLiveforever
#ラブライバーと繋がりたいRT https://t.co/kITPDLER9x
07114803986434/photo/1$738659292685979648
:13Z://pbs.twimg.com/media/CkA-exTWYAAK8TU.jpg
: 1000RT:【資金不足】「学園ハンサム」、クラウドファンディングでアニメ化支援を募集
https://t.co/CVM2F7rNt1
放送局やキャストは「支援額に応じて変わる」とのこと。時期は10月から1クールと発表されている。 http…
com/media/CkAftVyUYAA0nmn.jpg-06-03T10:11:13Z
miga, sutiã que é do dia a dia ela só usa de ser obrigada, acha mesmo q ela
compraria mais d…r Promoter | Worked with @inkmonstarr @breadboi @ayookd
@chapobandz and more | PayPal accepted | DM for beats | Beats Starting at
$10 |resenting August Redmoon at the Hollywood premiere of Inside Metal:
The Metal Scene Explodes! 落落落落 https://t.…
.jpg

-03T10:11:13Z

I assume it is all json data. So I can use solr to build index on these
files and do a search?

Or alternatively use it for a staging area for Hive table?


thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 8 June 2016 at 14:01, Jörn Franke  wrote:

> I mean what you should also look at is ingestion capacity. If you have a
> lots of irregular writes such as from sensor data, it can make sense to
> store them first in hbase and flush them regularly to Orc/parquet hive
> tables for analysis
>
> On 08 Jun 2016, at 13:15, Mich Talebzadeh 
> wrote:
>
> Interesting. There is also apache nifi 
>
> Also I note that one can store twitter data in Hive tables as well?
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 7 June 2016 at 15:59, Mich Talebzadeh 
> wrote:
>
>> thanks I will have a look.
>>
>> Mich
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 7 June 2016 at 13:38, Jörn Franke  wrote:
>>
>>> Solr is basically an in-memory text index with a lot of capabilities for
>>> language analysis extraction (you can compare  it to a Google for your
>>> tweets). The system itself has a lot of features and has a complexity
>>> similar to Big data systems. This index files can be backed by HDFS. You
>>> can put the tweets directly into solr without going via HDFS files.
>>>
>>> Carefully decide what fields to index / you want to search. It does not
>>> make sense to index everything.
>>>
>>> On 07 Jun 2016, at 13:51, Mich Talebzadeh 
>>> wrote:
>>>
>>> Ok So basically 

Re: comparaing row in pyspark data frame

2016-06-08 Thread Ted Yu
Do you mean returning col3 and 0.4 for the example row below ?

> On Jun 8, 2016, at 5:05 AM, pseudo oduesp  wrote:
> 
> Hi,
> how we can compare multiples columns in datframe i mean 
> 
> if  df it s dataframe like that :
> 
>df.col1 | df.col2 | df.col3 
>0.2  0.3  0.4 
> 
> how we can compare columns to get max of row not columns and get name of 
> columns where max it present ?
> 
> thanks

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Analyzing twitter data

2016-06-08 Thread Jörn Franke
That is trivial to do , I did it once when they were in json format

> On 08 Jun 2016, at 13:15, Mich Talebzadeh  wrote:
> 
> Interesting. There is also apache nifi
> 
> Also I note that one can store twitter data in Hive tables as well?
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 7 June 2016 at 15:59, Mich Talebzadeh  wrote:
>> thanks I will have a look.
>> 
>> Mich
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>>  
>> 
>>> On 7 June 2016 at 13:38, Jörn Franke  wrote:
>>> Solr is basically an in-memory text index with a lot of capabilities for 
>>> language analysis extraction (you can compare  it to a Google for your 
>>> tweets). The system itself has a lot of features and has a complexity 
>>> similar to Big data systems. This index files can be backed by HDFS. You 
>>> can put the tweets directly into solr without going via HDFS files.
>>> 
>>> Carefully decide what fields to index / you want to search. It does not 
>>> make sense to index everything.
>>> 
 On 07 Jun 2016, at 13:51, Mich Talebzadeh  
 wrote:
 
 Ok So basically for predictive off-line (as opposed to streaming) in a 
 nutshell one can use Apache Flume to store twitter data in hdfs and use 
 Solr to query the data?
 
 This is what it says:
 
 Solr is a standalone enterprise search server with a REST-like API. You 
 put documents in it (called "indexing") via JSON, XML, CSV or binary over 
 HTTP. You query it via HTTP GET and receive JSON, XML, CSV or binary 
 results.
 
 thanks
 
 Dr Mich Talebzadeh
  
 LinkedIn  
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
  
 http://talebzadehmich.wordpress.com
  
 
> On 7 June 2016 at 12:39, Jörn Franke  wrote:
> Well I have seen that The algorithms mentioned are used for this. However 
> some preprocessing through solr makes sense - it takes care of synonyms, 
> homonyms, stemming etc
> 
>> On 07 Jun 2016, at 13:33, Mich Talebzadeh  
>> wrote:
>> 
>> Thanks Jorn,
>> 
>> To start I would like to explore how can one turn some of the data into 
>> useful information.
>> 
>> I would like to look at certain trend analysis. Simple correlation shows 
>> that the more there is a mention of a typical topic say for example 
>> "organic food" the more people are inclined to go for it. To see one can 
>> deduce that orgaind food is a potential growth area.
>> 
>> Now I have all infra-structure to ingest that data. Like using flume to 
>> store it or Spark streaming to do near real time work.
>> 
>> Now I want to slice and dice that data for say organic food.
>> 
>> I presume this is a typical question.
>> 
>> You mentioned Spark ml (machine learning?) . Is that something viable?
>> 
>> Cheers
>> 
>> 
>> 
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>>  
>> 
>>> On 7 June 2016 at 12:22, Jörn Franke  wrote:
>>> Spark ml Support Vector machines or neural networks could be 
>>> candidates. 
>>> For unstructured learning it could be clustering.
>>> For doing a graph analysis On the followers you can easily use Spark 
>>> Graphx
>>> Keep in mind that each tweet contains a lot of meta data (location, 
>>> followers etc) that is more or less structured.
>>> For unstructured text analytics (eg tweet itself)I recommend 
>>> solr/ElasticSearch .
>>> 
>>> However I am not sure what you want to do with the data exactly.
>>> 
>>> 
 On 07 Jun 2016, at 13:16, Mich Talebzadeh  
 wrote:
 
 Hi,
 
 This is really a general question.
 
 I use Spark to get twitter data. I did some looking at it
 
 val ssc = new StreamingContext(sparkConf, Seconds(2))
 val tweets = TwitterUtils.createStream(ssc, None)
 val statuses = tweets.map(status => status.getText())
 statuses.print()
 
 Ok
 
 Also I can use Apache flume to store data in hdfs directory
 
 $FLUME_HOME/bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf 
 Dflume.root.logger=DEBUG,console -n TwitterAgent
 Now that stores twitter data in binary format in  hdfs 

Re: Analyzing twitter data

2016-06-08 Thread Mich Talebzadeh
Interesting. There is also apache nifi 

Also I note that one can store twitter data in Hive tables as well?



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 7 June 2016 at 15:59, Mich Talebzadeh  wrote:

> thanks I will have a look.
>
> Mich
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 7 June 2016 at 13:38, Jörn Franke  wrote:
>
>> Solr is basically an in-memory text index with a lot of capabilities for
>> language analysis extraction (you can compare  it to a Google for your
>> tweets). The system itself has a lot of features and has a complexity
>> similar to Big data systems. This index files can be backed by HDFS. You
>> can put the tweets directly into solr without going via HDFS files.
>>
>> Carefully decide what fields to index / you want to search. It does not
>> make sense to index everything.
>>
>> On 07 Jun 2016, at 13:51, Mich Talebzadeh 
>> wrote:
>>
>> Ok So basically for predictive off-line (as opposed to streaming) in a
>> nutshell one can use Apache Flume to store twitter data in hdfs and use
>> Solr to query the data?
>>
>> This is what it says:
>>
>> Solr is a standalone enterprise search server with a REST-like API. You
>> put documents in it (called "indexing") via JSON, XML, CSV or binary over
>> HTTP. You query it via HTTP GET and receive JSON, XML, CSV or binary
>> results.
>>
>> thanks
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 7 June 2016 at 12:39, Jörn Franke  wrote:
>>
>>> Well I have seen that The algorithms mentioned are used for this.
>>> However some preprocessing through solr makes sense - it takes care of
>>> synonyms, homonyms, stemming etc
>>>
>>> On 07 Jun 2016, at 13:33, Mich Talebzadeh 
>>> wrote:
>>>
>>> Thanks Jorn,
>>>
>>> To start I would like to explore how can one turn some of the data into
>>> useful information.
>>>
>>> I would like to look at certain trend analysis. Simple correlation shows
>>> that the more there is a mention of a typical topic say for example
>>> "organic food" the more people are inclined to go for it. To see one can
>>> deduce that orgaind food is a potential growth area.
>>>
>>> Now I have all infra-structure to ingest that data. Like using flume to
>>> store it or Spark streaming to do near real time work.
>>>
>>> Now I want to slice and dice that data for say organic food.
>>>
>>> I presume this is a typical question.
>>>
>>> You mentioned Spark ml (machine learning?) . Is that something viable?
>>>
>>> Cheers
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 7 June 2016 at 12:22, Jörn Franke  wrote:
>>>
 Spark ml Support Vector machines or neural networks could be
 candidates.
 For unstructured learning it could be clustering.
 For doing a graph analysis On the followers you can easily use Spark
 Graphx
 Keep in mind that each tweet contains a lot of meta data (location,
 followers etc) that is more or less structured.
 For unstructured text analytics (eg tweet itself)I recommend
 solr/ElasticSearch .

 However I am not sure what you want to do with the data exactly.


 On 07 Jun 2016, at 13:16, Mich Talebzadeh 
 wrote:

 Hi,

 This is really a general question.

 I use Spark to get twitter data. I did some looking at it

 val ssc = new StreamingContext(sparkConf, Seconds(2))
 val tweets = TwitterUtils.createStream(ssc, None)
 val statuses = tweets.map(status => status.getText())
 statuses.print()

 Ok

 Also I can use Apache flume to store data in hdfs directory

 $FLUME_HOME/bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf
 Dflume.root.logger=DEBUG,console -n TwitterAgent
 Now that stores twitter data in binary format in  hdfs directory.

 My question is pretty basic.

 What is the best tool/language to dif in to that data. For example
 twitter 

Re: oozie and spark on yarn

2016-06-08 Thread vaquar khan
Hi Karthi,

Hope following information will help you.

Doc:
https://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html

Example :
https://developer.ibm.com/hadoop/2015/11/05/run-spark-job-yarn-oozie/

Code :

http://3097fca9b1ec8942c4305e550ef1b50a.proxysheep.com/apache/oozie/blob/master/client/src/main/resources/spark-action-0.1.xsd

regards,

Vaquar Khan



On Wed, Jun 8, 2016 at 5:26 AM, karthi keyan 
wrote:

> Hi ,
>
> Make sure you have oozie 4.2.0 and configured with either yarn / mesos
> mode.
>
> Well, you just parse your scala / Jar file in the below syntax,
>
> 
> ${jobTracker}
> ${nameNode}
> ${master}
> Wordcount
> ${Classname}
> ${nameNode}/WordCount.jar 
>
> above sample will have the required file in HDFS and made change according
> to your use case.
>
> Regards,
> Karthik
>
>
> On Wed, Jun 8, 2016 at 1:10 PM, pseudo oduesp 
> wrote:
>
>> hi ,
>>
>> i want ask if somone used oozie with spark ?
>>
>> if you can give me example:
>> how ? we can configure  on yarn
>> thanks
>>
>
>


-- 
Regards,
Vaquar Khan
+91 830-851-1500


Re: Dealing with failures

2016-06-08 Thread Jacek Laskowski
On Wed, Jun 8, 2016 at 2:38 AM, Mohit Anchlia  wrote:
> I am looking to write an ETL job using spark that reads data from the
> source, perform transformation and insert it into the destination.

Is this going to be one-time job or you want it to run every time interval?

> 1. Source becomes slow or un-responsive. How to control such a situation so
> that it doesn't cause DDoS on the source?

Why do you think Spark would DDoS the source? I'm reading it as if
Spark tried to open a new connection after the currently-open one
became slow. I don't think it's how Spark does connections. What is
the source in your use case?

> Also, at the same time how to make it resilient that it does pick up from 
> where it left?

It sounds like checkpointing. It's available in Core and Streaming.
So, what's your source and how often do you want to query for data?
You may also benefit from the recent additions to Spark in 2.0 called
Structured Streaming (aka Streaming Datasets) - see
https://issues.apache.org/jira/browse/SPARK-8360.

> 2. In the same context when destination becomes slow or un-responsive.

What is a destination? It appears as if you were doing streaming and
want to use checkpointing and back-pressure. But you haven't said much
about your use case to be precise.

Jacek

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



OneVsRest SVM - Very Low F-Measure compared to OneVsRest Logistic Regression

2016-06-08 Thread Hayri Volkan Agun
Hi,

I build a transformer model for spark svm for binary classification. I
basically implement the predictRaw method for classification and
classification model of spark api.

override def predictRaw(dataMatrix: Vector):Vector = {
  val m = weights.toBreeze.dot(dataMatrix.toBreeze) + intercept
  Vectors.dense(-m, m)
}


I have an imbalanced text dataset. The scores of logistic regression and
naive bayes for bag of words model is very high for author classification
with OneVsRest settings but the scores of SVM is very low.  I am using
standard parameters of SVM with 3000 maximum iteration in OneVsRest.
What might be the problem? I am using the same features (200125), labels
(9), ~1500 training instances, ~500 test instances and OneVsRest for all
the compared settings.


Thanks in advance...
Hayri Volkan Agun
PhD. Student - Anadolu University


Re: Trainning a spark ml linear regresion model fail after migrating from 1.5.2 to 1.6.1

2016-06-08 Thread Jacek Laskowski
Hi,

Is it me only to *not* see the snippets? Could you please gist 'em =>
https://gist.github.com ?

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Wed, Jun 8, 2016 at 10:22 AM, philippe v  wrote:
> I use spark-ml to train a linear regression model. It worked perfectly with
> spark version 1.5.2 but now with 1.6.1 I get the following error :
>
>
>
> Here is a minimal code :
>
>
>
> And input.csv data
>
>
>
> the pom.xml
>
>
>
>
> How can I fix it ?
>
>
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Trainning-a-spark-ml-linear-regresion-model-fail-after-migrating-from-1-5-2-to-1-6-1-tp27111.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



unsubscribe

2016-06-08 Thread Amal Babu



Trainning a spark ml linear regresion model fail after migrating from 1.5.2 to 1.6.1

2016-06-08 Thread philippe v
I use spark-ml to train a linear regression model. It worked perfectly with
spark version 1.5.2 but now with 1.6.1 I get the following error :



Here is a minimal code : 



And input.csv data



the pom.xml




How can I fix it ? 






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Trainning-a-spark-ml-linear-regresion-model-fail-after-migrating-from-1-5-2-to-1-6-1-tp27111.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark 2.0 Release Date

2016-06-08 Thread Jacek Laskowski
Whoohoo! What a great news! Looks like a RC is coming...Thanks a lot, Reynold!

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Wed, Jun 8, 2016 at 7:55 AM, Reynold Xin  wrote:
> It'd be great to cut an RC as soon as possible. Looking at the
> blocker/critical issue list, majority of them are API audits. I think people
> will get back to those once Spark Summit is over, and then we should see
> some good progress towards an RC.
>
> On Tue, Jun 7, 2016 at 6:20 AM, Jacek Laskowski  wrote:
>>
>> Finally, the PMC voice on the subject. Thanks a lot, Sean!
>>
>> p.s. Given how much time it takes to ship 2.0 (with so many cool
>> features already backed in!) I'd vote for releasing a few more RCs
>> before 2.0 hits the shelves. I hope 2.0 is not Java 9 or Jigsaw ;-)
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Tue, Jun 7, 2016 at 3:06 PM, Sean Owen  wrote:
>> > I don't believe the intent was to get it out before Spark Summit or
>> > something. That shouldn't drive the schedule anyway. But now that
>> > there's a 2.0.0 preview available, people who are eager to experiment
>> > or test on it can do so now.
>> >
>> > That probably reduces urgency to get it out the door in order to
>> > deliver new functionality. I guessed the 2.0.0 release would be mid
>> > June and now I'd guess early July. But, nobody's discussed it per se.
>> >
>> > In theory only fixes, tests and docs are being merged, so the JIRA
>> > count should be going down. It has, slowly. Right now there are 72
>> > open issues for 2.0.0, of which 20 are blockers. Most of those are
>> > simple "audit x" or "document x" tasks or umbrellas, but, they do
>> > represent things that have to get done before a release, and that to
>> > me looks like a few more weeks of finishing, pushing, and sweeping
>> > under carpets.
>> >
>> >
>> > On Tue, Jun 7, 2016 at 1:45 PM, Jacek Laskowski  wrote:
>> >> On Tue, Jun 7, 2016 at 1:25 PM, Arun Patel 
>> >> wrote:
>> >>> Do we have any further updates on release date?
>> >>
>> >> Nope :( And it's even more quiet than I could have thought. I was so
>> >> certain that today's the date. Looks like Spark Summit has "consumed"
>> >> all the people behind 2.0...Can't believe no one (from the
>> >> PMC/committers) even mean to shoot a date :( Patrick's gone. Reynold's
>> >> busy. Perhaps Sean?
>> >>
>> >>> Also, Is there a updated documentation for 2.0 somewhere?
>> >>
>> >>
>> >> http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/
>> >>  ?
>> >>
>> >> Jacek
>> >>
>> >> -
>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >> For additional commands, e-mail: user-h...@spark.apache.org
>> >>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



oozie and spark on yarn

2016-06-08 Thread pseudo oduesp
hi ,

i want ask if somone used oozie with spark ?

if you can give me example:
how ? we can configure  on yarn
thanks


Spark streaming micro batch failure handling

2016-06-08 Thread aviemzur
Hi,

A question about spark streaming handling of failed micro batch.

After a certain amount of task failures, there are no more retries, and the
entire batch fails.
What seems to happen next is that this batch is ignored and the next micro
batch begins, which means not all the data has been processed.

Is there a way to configure the spark streaming application to not continue
to the next batch, but rather stop the streaming context upon a micro batch
failure (After all task retries have been exhausted)?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-micro-batch-failure-handling-tp27110.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org