Weird experience Hive with Spark Transformations

2017-01-16 Thread Chetan Khatri
Hello,

I have following services are configured and installed successfully:

Hadoop 2.7.x
Spark 2.0.x
HBase 1.2.4
Hive 1.2.1

*Installation Directories:*

/usr/local/hadoop
/usr/local/spark
/usr/local/hbase

*Hive Environment variables:*

#HIVE VARIABLES START
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
#HIVE VARIABLES END

So, I can access Hive from anywhere as environment variables are
configured. Now if if i start my spark-shell & hive from location
/usr/local/hive then both work good for hive-metastore other wise from
where i start spark-shell where spark creates own meta-store.

i.e I am reading from HBase and Writing to Hive using Spark. I dont know
why this is weird issue is.




Thanks.


Re: About saving DataFrame to Hive 1.2.1 with Spark 2.0.1

2017-01-16 Thread Chetan Khatri
Hello Spark Folks,

Other weird experience i have with Spark with SqlContext is when i created
Dataframe sometime this error throws exception and sometime not !

scala> import sqlContext.implicits._
import sqlContext.implicits._

scala> val stdDf = sqlContext.createDataFrame(rowRDD,empSchema.struct);
17/01/17 10:27:15 ERROR metastore.RetryingHMSHandler:
AlreadyExistsException(message:Database default already exists)
at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_database(HiveMetaStore.java:891)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107)
at com.sun.proxy.$Proxy21.create_database(Unknown Source)
at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createDatabase(HiveMetaStoreClient.java:644)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
at com.sun.proxy.$Proxy22.createDatabase(Unknown Source)
at org.apache.hadoop.hive.ql.metadata.Hive.createDatabase(Hive.java:306)
at
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply$mcV$sp(HiveClientImpl.scala:309)
at
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply(HiveClientImpl.scala:309)
at
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply(HiveClientImpl.scala:309)
at
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:280)
at
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:227)
at
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:226)
at
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:269)
at
org.apache.spark.sql.hive.client.HiveClientImpl.createDatabase(HiveClientImpl.scala:308)
at
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply$mcV$sp(HiveExternalCatalog.scala:99)
at
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply(HiveExternalCatalog.scala:99)
at
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply(HiveExternalCatalog.scala:99)
at
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:72)
at
org.apache.spark.sql.hive.HiveExternalCatalog.createDatabase(HiveExternalCatalog.scala:98)
at
org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:147)
at
org.apache.spark.sql.catalyst.catalog.SessionCatalog.(SessionCatalog.scala:89)
at
org.apache.spark.sql.hive.HiveSessionCatalog.(HiveSessionCatalog.scala:51)
at
org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:49)
at
org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
at
org.apache.spark.sql.hive.HiveSessionState$$anon$1.(HiveSessionState.scala:63)
at
org.apache.spark.sql.hive.HiveSessionState.analyzer$lzycompute(HiveSessionState.scala:63)
at
org.apache.spark.sql.hive.HiveSessionState.analyzer(HiveSessionState.scala:62)
at
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:542)
at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:302)
at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:337)
at
$line28.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:43)
at
$line28.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:48)
at
$line28.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:50)
at $line28.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:52)
at $line28.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:54)
at $line28.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:56)
at $line28.$read$$iw$$iw$$iw$$iw$$iw$$iw.(:58)
at $line28.$read$$iw$$iw$$iw$$iw$$iw.(:60)
at $line28.$read$$iw$$iw$$iw$$iw.(:62)
at $line28.$read$$iw$$iw$$iw.(:64)
at $line28.$read$$iw$$iw.(:66)
at $line28.$read$$iw.(:68)
at $line28.$read.(:70)
at $line28.$read$.(:74)
at $line28.$read$.()
at $line28.$eval$.$print$lzycompute(:7)
at $line28.$eval$.$print(:6)
at $line28.$eval.$print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)

Re: Equally split a RDD partition into two partition at the same node

2017-01-16 Thread Pradeep Gollakota
Usually this kind of thing can be done at a lower level in the InputFormat
usually by specifying the max split size. Have you looked into that
possibility with your InputFormat?

On Sun, Jan 15, 2017 at 9:42 PM, Fei Hu  wrote:

> Hi Jasbir,
>
> Yes, you are right. Do you have any idea about my question?
>
> Thanks,
> Fei
>
> On Mon, Jan 16, 2017 at 12:37 AM,  wrote:
>
>> Hi,
>>
>>
>>
>> Coalesce is used to decrease the number of partitions. If you give the
>> value of numPartitions greater than the current partition, I don’t think
>> RDD number of partitions will be increased.
>>
>>
>>
>> Thanks,
>>
>> Jasbir
>>
>>
>>
>> *From:* Fei Hu [mailto:hufe...@gmail.com]
>> *Sent:* Sunday, January 15, 2017 10:10 PM
>> *To:* zouz...@cs.toronto.edu
>> *Cc:* user @spark ; dev@spark.apache.org
>> *Subject:* Re: Equally split a RDD partition into two partition at the
>> same node
>>
>>
>>
>> Hi Anastasios,
>>
>>
>>
>> Thanks for your reply. If I just increase the numPartitions to be twice
>> larger, how coalesce(numPartitions: Int, shuffle: Boolean = false) keeps
>> the data locality? Do I need to define my own Partitioner?
>>
>>
>>
>> Thanks,
>>
>> Fei
>>
>>
>>
>> On Sun, Jan 15, 2017 at 3:58 AM, Anastasios Zouzias 
>> wrote:
>>
>> Hi Fei,
>>
>>
>>
>> How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?
>>
>>
>>
>> https://github.com/apache/spark/blob/branch-1.6/core/src/
>> main/scala/org/apache/spark/rdd/RDD.scala#L395
>> 
>>
>>
>>
>> coalesce is mostly used for reducing the number of partitions before
>> writing to HDFS, but it might still be a narrow dependency (satisfying your
>> requirements) if you increase the # of partitions.
>>
>>
>>
>> Best,
>>
>> Anastasios
>>
>>
>>
>> On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu  wrote:
>>
>> Dear all,
>>
>>
>>
>> I want to equally divide a RDD partition into two partitions. That means,
>> the first half of elements in the partition will create a new partition,
>> and the second half of elements in the partition will generate another new
>> partition. But the two new partitions are required to be at the same node
>> with their parent partition, which can help get high data locality.
>>
>>
>>
>> Is there anyone who knows how to implement it or any hints for it?
>>
>>
>>
>> Thanks in advance,
>>
>> Fei
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> -- Anastasios Zouzias
>>
>>
>>
>> --
>>
>> This message is for the designated recipient only and may contain
>> privileged, proprietary, or otherwise confidential information. If you have
>> received it in error, please notify the sender immediately and delete the
>> original. Any other use of the e-mail by you is prohibited. Where allowed
>> by local law, electronic communications with Accenture and its affiliates,
>> including e-mail and instant messaging (including content), may be scanned
>> by our systems for the purposes of information security and assessment of
>> internal compliance with Accenture policy.
>> 
>> __
>>
>> www.accenture.com
>>
>
>


Re: spark support on windows

2017-01-16 Thread Steve Loughran

On 16 Jan 2017, at 11:06, Hyukjin Kwon 
> wrote:

Hi,

I just looked through Jacek's page and I believe that is the correct way.

That seems to be a Hadoop library specific issue[1]. Up to my knowledge, 
winutils and the binaries in the private repo
 are built by a Hadoop PMC member on a dedicated Windows VM which I believe are 
pretty trustable.

thank you :)

I also check out and build the specific git commit SHA1 of the release, not any 
(moveable) tag, so we have identical sources for my builds as the matching 
releases.

This can be compile from the source. If you think it is not reliable and not 
safe, you can go and build it by your self.

I agree it would be great if there are documentation about this as we have a 
weak promise for Windows[2] and
I believe it always require some overhead to install Spark on Windows. FWIW, In 
case of SparkR, there are some
documentation [3].

For bundling it, it seems even Hadoop itself does not include this in their 
releases. I think documentation would be
enough.

Really, Hadoop itself should be doing the release of the windows binaries. It's 
just it complicates the release process as the linux build/test/release would 
have to be done, then somehow the windows stuff would need to be done on 
another machine and mixed in. That's the real barrier: extra work. That said, 
maybe it's time.




For many JIRAs, at least I am resolving it one by one.

I hope my answer is helpful and makes sense.

Thanks.


[1] https://wiki.apache.org/hadoop/WindowsProblems
[2] 
https://github.com/apache/spark/blob/f3a3fed76cb74ecd0f46031f337576ce60f54fb2/docs/index.md
[3] https://github.com/apache/spark/blob/master/R/WINDOWS.md


2017-01-16 19:35 GMT+09:00 assaf.mendelson 
>:
Hi,
In the documentation it says spark is supported on windows.
The problem, however, is that the documentation description on windows is 
lacking. There are sources (such as 
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html
 and many more) which explain how to make spark run on windows, however, they 
all involve downloading a third party winutil.exe file.
Since this file is downloaded from a repository belonging to a private person, 
this can be an issue (e.g. getting approval to install on a company computer 
can be an issue).
There are tons of jira tickets on the subject (most are marked as duplicate or 
not a problem), however, I believe that if we say spark is supported on windows 
there should be a clear explanation on how to run it and one shouldn’t have to 
use executable from a private person.

If indeed using winutil.exe is the correct solution, I believe it should be 
bundled to the spark binary distribution along with clear instructions on how 
to add it.
Assaf.


View this message in context: spark support on 
windows
Sent from the Apache Spark Developers List mailing list 
archive at 
Nabble.com.




Re: Both Spark AM and Client are trying to delete Staging Directory

2017-01-16 Thread Steve Loughran

On 16 Jan 2017, at 12:51, Rostyslav Sotnychenko 
> wrote:

Thanks all!

I was using another DFS instead of HDFS, which was logging an error when 
fs.delete got called on non-existing path.


really? Whose DFS, if you don't mind me asking? I'm surprised they logged that 
delete() of a missing path, as it's not entirely uncommon to happen during 
cleanup

In Spark 2.0.1 which I was using previously, everything was working fine 
because existence of an additional check that was made prior to deleting. 
However that check got removed in 2.1 
(SPARK-16736, 
commit),
 so I started seeing an error from my DFS.

Its not a problem in any way (i.e. it does not affect Spark job in any way), so 
everything is fine. I just wanted to make sure its not a Spark issue.


Thanks,
Rostyslav



No, not a problem. Really that whole idea of having delete() return true/false 
is pretty useless: nobody really knows what it means when it returns false. It 
should just have been void() wth an exception thrown if something actually 
failed. That's essentially what they all do, though I've never seen any which 
complains about the situation.

mkdirs(), now there's one to fear. Not even the java.io API 
clearly defines what "false" coming back from there means, as it can mean both 
"ther'es a directory there, so I didnt' do any work", or "there's a 
file/symlnk/mount point/device there which is a probably a serious problem"


Re: spark support on windows

2017-01-16 Thread Steve Loughran

On 16 Jan 2017, at 10:35, assaf.mendelson 
> wrote:

Hi,
In the documentation it says spark is supported on windows.
The problem, however, is that the documentation description on windows is 
lacking. There are sources (such as 
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html
 and many more) which explain how to make spark run on windows, however, they 
all involve downloading a third party winutil.exe file.
Since this file is downloaded from a repository belonging to a private person,

A repository belonging to me, ste...@apache.org

this can be an issue (e.g. getting approval to install on a company computer 
can be an issue).


An a committer on the Hadoop PMC, those signed artifacts are no less 
trustworthy than anything you get from the ASF itself. It's clean built off a 
windows VM that is only ever used for build/test of Hadoop code, no other use 
at all; the VM is powered off most of its life. This actually makes it less of 
a security risk than the main desktop. And you can check the GPG signature of 
the artifacts to see they've not been tampered with.

There are tons of jira tickets on the subject (most are marked as duplicate or 
not a problem), however, I believe that if we say spark is supported on windows 
there should be a clear explanation on how to run it and one shouldn’t have to 
use executable from a private person.

While I recognise your concerns, if I wanted to run code on your machines, rest 
assured, I wouldn't do it in such an obvious way.

I'd do it via transitive maven artifacts with a harmless name like 
"org.example.xml-unit-diags" which would so something useful except in the 
special case that is' running on code in your subnet, get a patch a pom.xml to 
pull it into org.apache.hadoop somewhere, release a version of hadoop with that 
dependency, then wait for it to propagate downstream into everything, including 
all those server farms running linux only.

Writing a malicious windows native excutable would require me to write C/C++ 
windows code, and I don't want to go there.

Of course, if I did any of these I'd be in trouble when caught, lose my job, 
never be trusted to submit a line of code to any OSS project, lose all my 
friends, etc, etc. I have nothing to gain by doing so.

If you really don't trust me the instructions for building it are up online; 
build  a windows system for compiuling hadoop, check out the branch and then go

  mvn -T 1C package -Pdist -Dmaven.javadoc.skip=true -DskipTests

Or go to hortonworks.com, download the windows version 
and lift the windows binaries. Same thing, built by a colleague-managed release 
VM.


If indeed using winutil.exe is the correct solution, I believe it should be 
bundled to the spark binary distribution along with clear instructions on how 
to add it.
I recognise that it is good to question the provenance of every line of code 
executed on machines you care about. I am reasonably confident as so the 
quality of this code; given the fact it was a checkout & build of the ASF 
tagged release, then signed my me, it'd either need my VM corrupted, my VM's 
feed from the ASF HTTPS repo subverted by a fake SSL cert, or by someone 
getting hold of my GPG key and github keys and uploading something malicious in 
my name. Interestingly, that is a vulnerability, one I covered last year in my 
"Household infosec in a post-sony era: talk: 
https://www.youtube.com/watch?v=tcRjG1CCrPs

You'll be pleased to know that the relevant keys now live on a yubikey, so even 
malicious code executed on my desktop cannot get the secrets off the 
(encrypted) local drive. It'd need physical access to the key, and I'd notice 
it was missing, revoke everything, etc, etc, making the risk of my keys being 
stolen low. That leaves the general problem of "our entire build process is 
based on the assumption that we truest the maven repositories and the people 
who wrote the JARs"

That's a far more serious problem than the provenance of a single exe file on 
github

-Steve


Re: Equally split a RDD partition into two partition at the same node

2017-01-16 Thread Fei Hu
Hi Pradeep,

That is a good idea. My customized RDDs are similar to the NewHadoopRDD. If
we have billions of InputSplit, will it be bottlenecked for the
performance? That is, will too many data need to be transferred from master
node to computing nodes by networking?

Thanks,
Fei

On Mon, Jan 16, 2017 at 2:07 PM, Pradeep Gollakota 
wrote:

> Usually this kind of thing can be done at a lower level in the InputFormat
> usually by specifying the max split size. Have you looked into that
> possibility with your InputFormat?
>
> On Sun, Jan 15, 2017 at 9:42 PM, Fei Hu  wrote:
>
>> Hi Jasbir,
>>
>> Yes, you are right. Do you have any idea about my question?
>>
>> Thanks,
>> Fei
>>
>> On Mon, Jan 16, 2017 at 12:37 AM,  wrote:
>>
>>> Hi,
>>>
>>>
>>>
>>> Coalesce is used to decrease the number of partitions. If you give the
>>> value of numPartitions greater than the current partition, I don’t think
>>> RDD number of partitions will be increased.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Jasbir
>>>
>>>
>>>
>>> *From:* Fei Hu [mailto:hufe...@gmail.com]
>>> *Sent:* Sunday, January 15, 2017 10:10 PM
>>> *To:* zouz...@cs.toronto.edu
>>> *Cc:* user @spark ; dev@spark.apache.org
>>> *Subject:* Re: Equally split a RDD partition into two partition at the
>>> same node
>>>
>>>
>>>
>>> Hi Anastasios,
>>>
>>>
>>>
>>> Thanks for your reply. If I just increase the numPartitions to be twice
>>> larger, how coalesce(numPartitions: Int, shuffle: Boolean = false)
>>> keeps the data locality? Do I need to define my own Partitioner?
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Fei
>>>
>>>
>>>
>>> On Sun, Jan 15, 2017 at 3:58 AM, Anastasios Zouzias 
>>> wrote:
>>>
>>> Hi Fei,
>>>
>>>
>>>
>>> How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?
>>>
>>>
>>>
>>> https://github.com/apache/spark/blob/branch-1.6/core/src/mai
>>> n/scala/org/apache/spark/rdd/RDD.scala#L395
>>> 
>>>
>>>
>>>
>>> coalesce is mostly used for reducing the number of partitions before
>>> writing to HDFS, but it might still be a narrow dependency (satisfying your
>>> requirements) if you increase the # of partitions.
>>>
>>>
>>>
>>> Best,
>>>
>>> Anastasios
>>>
>>>
>>>
>>> On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu  wrote:
>>>
>>> Dear all,
>>>
>>>
>>>
>>> I want to equally divide a RDD partition into two partitions. That
>>> means, the first half of elements in the partition will create a new
>>> partition, and the second half of elements in the partition will generate
>>> another new partition. But the two new partitions are required to be at the
>>> same node with their parent partition, which can help get high data
>>> locality.
>>>
>>>
>>>
>>> Is there anyone who knows how to implement it or any hints for it?
>>>
>>>
>>>
>>> Thanks in advance,
>>>
>>> Fei
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> -- Anastasios Zouzias
>>>
>>>
>>>
>>> --
>>>
>>> This message is for the designated recipient only and may contain
>>> privileged, proprietary, or otherwise confidential information. If you have
>>> received it in error, please notify the sender immediately and delete the
>>> original. Any other use of the e-mail by you is prohibited. Where allowed
>>> by local law, electronic communications with Accenture and its affiliates,
>>> including e-mail and instant messaging (including content), may be scanned
>>> by our systems for the purposes of information security and assessment of
>>> internal compliance with Accenture policy.
>>> 
>>> __
>>>
>>> www.accenture.com
>>>
>>
>>
>


About saving DataFrame to Hive 1.2.1 with Spark 2.0.1

2017-01-16 Thread Chetan Khatri
Hello Community,

I am struggling to save Dataframe to Hive Table,

Versions:

Hive 1.2.1
Spark 2.0.1

*Working code:*

/*
@Author: Chetan Khatri
/* @Author: Chetan Khatri Description: This Scala script has written for
HBase to Hive module, which reads table from HBase and dump it out to Hive
*/ import it.nerdammer.spark.hbase._ import org.apache.spark.sql.Row import
org.apache.spark.sql.types.StructType import
org.apache.spark.sql.types.StructField import
org.apache.spark.sql.types.StringType import
org.apache.spark.sql.SparkSession // Approach 1: // Read HBase Table val
hBaseRDD = sc.hbaseTable[(Option[String], Option[String], Option[String],
Option[String], Option[String])]("university").select("stid",
"name","subject","grade","city").inColumnFamily("emp") // Iterate HBaseRDD
and generate RDD[Row] val rowRDD = hBaseRDD.map(i =>
Row(i._1.get,i._2.get,i._3.get,i._4.get,i._5.get)) // Create sqlContext for
createDataFrame method val sqlContext = new
org.apache.spark.sql.SQLContext(sc) // Create Schema Structure object
empSchema { val stid = StructField("stid", StringType) val name =
StructField("name", StringType) val subject = StructField("subject",
StringType) val grade = StructField("grade", StringType) val city =
StructField("city", StringType) val struct = StructType(Array(stid, name,
subject, grade, city)) } import sqlContext.implicits._ // Create DataFrame
with rowRDD and Schema structure val stdDf =
sqlContext.createDataFrame(rowRDD,empSchema.struct); // Importing Hive
import org.apache.spark.sql.hive // Enable Hive with Hive warehouse in
SparkSession val spark = SparkSession.builder().appName("Spark Hive
Example").config("spark.sql.warehouse.dir",
"/usr/local/hive/warehouse/").enableHiveSupport().getOrCreate() // Saving
Dataframe to Hive Table Successfully.
stdDf.write.mode("append").saveAsTable("employee") // Approach 2 : Where
error comes import spark.implicits._ import spark.sql sql("use default")
sql("create table employee(stid STRING, name STRING, subject STRING, grade
STRING, city STRING)") scala> sql("show TABLES").show()
+-+---+ |tableName|isTemporary| +-+---+ |
employee| false| +-+---+
stdDf.write.mode("append").saveAsTable("employee") ERROR Exception:
org.apache.spark.sql.AnalysisException: Saving data in MetastoreRelation
default, employee is not supported.; at
org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:221)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) at
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
at
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
at
org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:378)
at
org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:354)
... 56 elided Questions: At Approach 1, It stores data where hive table is
not previously created, when i say saveAsTable it automatically creates for
me and next time it also appends data into that, How to store data in
previously created tables ?
It also gives warning WARN metastore.HiveMetaStore: Location:
file:/usr/local/spark/spark-warehouse/employee specified for non-external
table:employee but i have already provided path of HiveMetaStore then why
it is storing in spark's warehouse meta-store.

Hive-setup done with reference to:
http://mitu.co.in/wp-content/uploads/2015/12/Hive-Installation-on-Ubuntu-14.04-and-Hadoop-2.6.3.pdf
and it's working well, I could not change the Hive version, it must be 1.2.1

Thank you.


Re: Why are ml models repartition(1)'d in save methods?

2017-01-16 Thread Asher Krim
Cool, thanks!

Jira: https://issues.apache.org/jira/browse/SPARK-19247
PR: https://github.com/apache/spark/pull/16607

I think the LDA model has the exact same issues - currently the
`topicsMatrix` (which is on order of numWords*k, 4GB for numWords=3m and
k=1000) is saved as a single element in a case class. We should probably
address this in another issue.

On Fri, Jan 13, 2017 at 3:55 PM, Sean Owen  wrote:

> Yes, certainly debatable for word2vec. You have a good point that this
> could overrun the 2GB limit if the model is one big datum, for large but
> not crazy models. This model could probably easily be serialized as
> individual vectors in this case. It would introduce a
> backwards-compatibility issue but it's possible to read old and new
> formats, I believe.
>
> On Fri, Jan 13, 2017 at 8:16 PM Asher Krim  wrote:
>
>> I guess it depends on the definition of "small". A Word2vec model with
>> vectorSize=300 and vocabulary=3m takes nearly 4gb. While it does fit on a
>> single machine (so isn't really "big" data), I don't see the benefit in
>> having the model stored in one file. On the contrary, it seems that we
>> would want the model to be distributed:
>> * avoids shuffling of data to one executor
>> * allows the whole cluster to participate in saving the model
>> * avoids rpc issues (http://stackoverflow.com/questions/40842736/spark-
>> word2vecmodel-exceeds-max-rpc-size-for-saving)
>> * "feature parity" with mllib (issues with one large model file already
>> solved for mllib in SPARK-11994
>> )
>>
>>
>> On Fri, Jan 13, 2017 at 1:02 PM, Nick Pentreath > > wrote:
>>
>> Yup - it's because almost all model data in spark ML (model coefficients)
>> is "small" - i.e. Non distributed.
>>
>> If you look at ALS you'll see there is no repartitioning since the factor
>> dataframes can be large
>> On Fri, 13 Jan 2017 at 19:42, Sean Owen  wrote:
>>
>> You're referring to code that serializes models, which are quite small.
>> For example a PCA model consists of a few principal component vector. It's
>> a Dataset of just one element being saved here. It's re-using the code path
>> normally used to save big data sets, to output 1 file with 1 thing as
>> Parquet.
>>
>> On Fri, Jan 13, 2017 at 5:29 PM Asher Krim  wrote:
>>
>> But why is that beneficial? The data is supposedly quite large,
>> distributing it across many partitions/files would seem to make sense.
>>
>> On Fri, Jan 13, 2017 at 12:25 PM, Sean Owen  wrote:
>>
>> That is usually so the result comes out in one file, not partitioned over
>> n files.
>>
>> On Fri, Jan 13, 2017 at 5:23 PM Asher Krim  wrote:
>>
>> Hi,
>>
>> I'm curious why it's common for data to be repartitioned to 1 partition
>> when saving ml models:
>>
>> sqlContext.createDataFrame(Seq(data)).repartition(1).write.
>> parquet(dataPath)
>>
>> This shows up in most ml models I've seen (Word2Vec
>> ,
>> PCA
>> ,
>> LDA
>> ).
>> Am I missing some benefit of repartitioning like this?
>>
>> Thanks,
>> --
>> Asher Krim
>> Senior Software Engineer
>>
>>
>>
>>
>> --
>> Asher Krim
>> Senior Software Engineer
>>
>>


Re: Equally split a RDD partition into two partition at the same node

2017-01-16 Thread Fei Hu
Hi Liang-Chi,

Yes, the logic split is needed in compute(). The preferred locations can be
derived from the customized Partition class.

Thanks for your help!

Cheers,
Fei


On Mon, Jan 16, 2017 at 3:00 AM, Liang-Chi Hsieh  wrote:

>
> Hi Fei,
>
> I think it should work. But you may need to add few logic in compute() to
> decide which half of the parent partition is needed to output. And you need
> to get the correct preferred locations for the partitions sharing the same
> parent partition.
>
>
> Fei Hu wrote
> > Hi Liang-Chi,
> >
> > Yes, you are right. I implement the following solution for this problem,
> > and it works. But I am not sure if it is efficient:
> >
> > I double the partitions of the parent RDD, and then use the new
> partitions
> > and parent RDD to construct the target RDD. In the compute() function of
> > the target RDD, I use the input partition to get the corresponding parent
> > partition, and get the half elements in the parent partitions as the
> > output
> > of the computing function.
> >
> > Thanks,
> > Fei
> >
> > On Sun, Jan 15, 2017 at 11:01 PM, Liang-Chi Hsieh 
>
> > viirya@
>
> >  wrote:
> >
> >>
> >> Hi,
> >>
> >> When calling `coalesce` with `shuffle = false`, it is going to produce
> at
> >> most min(numPartitions, previous RDD's number of partitions). So I think
> >> it
> >> can't be used to double the number of partitions.
> >>
> >>
> >> Anastasios Zouzias wrote
> >> > Hi Fei,
> >> >
> >> > How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?
> >> >
> >> > https://github.com/apache/spark/blob/branch-1.6/core/
> >> src/main/scala/org/apache/spark/rdd/RDD.scala#L395
> >> >
> >> > coalesce is mostly used for reducing the number of partitions before
> >> > writing to HDFS, but it might still be a narrow dependency (satisfying
> >> > your
> >> > requirements) if you increase the # of partitions.
> >> >
> >> > Best,
> >> > Anastasios
> >> >
> >> > On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu 
> >>
> >> > hufei68@
> >>
> >> >  wrote:
> >> >
> >> >> Dear all,
> >> >>
> >> >> I want to equally divide a RDD partition into two partitions. That
> >> means,
> >> >> the first half of elements in the partition will create a new
> >> partition,
> >> >> and the second half of elements in the partition will generate
> another
> >> >> new
> >> >> partition. But the two new partitions are required to be at the same
> >> node
> >> >> with their parent partition, which can help get high data locality.
> >> >>
> >> >> Is there anyone who knows how to implement it or any hints for it?
> >> >>
> >> >> Thanks in advance,
> >> >> Fei
> >> >>
> >> >>
> >> >
> >> >
> >> > --
> >> > -- Anastasios Zouzias
> >> > 
> >>
> >> > azo@.ibm
> >>
> >> > 
> >>
> >>
> >>
> >>
> >>
> >> -
> >> Liang-Chi Hsieh | @viirya
> >> Spark Technology Center
> >> http://www.spark.tc/
> >> --
> >> View this message in context: http://apache-spark-
> >> developers-list.1001551.n3.nabble.com/Equally-split-a-
> >> RDD-partition-into-two-partition-at-the-same-node-tp20597p20608.html
> >> Sent from the Apache Spark Developers List mailing list archive at
> >> Nabble.com.
> >>
> >> -
> >> To unsubscribe e-mail:
>
> > dev-unsubscribe@.apache
>
> >>
> >>
>
>
>
>
>
> -
> Liang-Chi Hsieh | @viirya
> Spark Technology Center
> http://www.spark.tc/
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Equally-split-a-
> RDD-partition-into-two-partition-at-the-same-node-tp20597p20613.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Both Spark AM and Client are trying to delete Staging Directory

2017-01-16 Thread Rostyslav Sotnychenko
Thanks all!

I was using another DFS instead of HDFS, which was logging an error when
fs.delete got called on non-existing path.
In Spark 2.0.1 which I was using previously, everything was working fine
because existence of an additional check that was made prior to deleting.
However that check got removed in 2.1 (SPARK-16736
, commit
),
so I started seeing an error from my DFS.

Its not a problem in any way (i.e. it does not affect Spark job in any
way), so everything is fine. I just wanted to make sure its not a Spark
issue.


Thanks,
Rostyslav

On Sun, Jan 15, 2017 at 3:19 PM, Liang-Chi Hsieh  wrote:

>
> Hi,
>
> Will it be a problem if the staging directory is already deleted? Because
> even the directory doesn't exist, fs.delete(stagingDirPath, true) won't
> cause failure but just return false.
>
>
> Rostyslav Sotnychenko wrote
> > Hi all!
> >
> > I am a bit confused why Spark AM and Client are both trying to delete
> > Staging Directory.
> >
> > https://github.com/apache/spark/blob/branch-2.1/yarn/
> src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L1110
> > https://github.com/apache/spark/blob/branch-2.1/yarn/
> src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L233
> >
> > As you can see, in case if a job was running on YARN in Cluster
> deployment
> > mode, both AM and Client will try to delete Staging directory if job
> > succeeded and eventually one of them will fail to do this, because the
> > other one already deleted the directory.
> >
> > Shouldn't we add some check to Client?
> >
> >
> > Thanks,
> > Rostyslav
>
>
>
>
>
> -
> Liang-Chi Hsieh | @viirya
> Spark Technology Center
> http://www.spark.tc/
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Both-Spark-AM-and-
> Client-are-trying-to-delete-Staging-Directory-tp20588p20600.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: spark support on windows

2017-01-16 Thread Hyukjin Kwon
Hi,

I just looked through Jacek's page and I believe that is the correct way.

That seems to be a Hadoop library specific issue[1]. Up to my
knowledge, winutils and the binaries in the private repo
 are built by a Hadoop PMC member on a dedicated Windows VM which I believe
are pretty trustable.
This can be compile from the source. If you think it is not reliable and
not safe, you can go and build it by your self.

I agree it would be great if there are documentation about this as we have
a weak promise for Windows[2] and
I believe it always require some overhead to install Spark on Windows.
FWIW, In case of SparkR, there are some
documentation [3].

For bundling it, it seems even Hadoop itself does not include this in their
releases. I think documentation would be
enough.

For many JIRAs, at least I am resolving it one by one.

I hope my answer is helpful and makes sense.

Thanks.


[1] https://wiki.apache.org/hadoop/WindowsProblems
[2]
https://github.com/apache/spark/blob/f3a3fed76cb74ecd0f46031f337576ce60f54fb2/docs/index.md
[3] https://github.com/apache/spark/blob/master/R/WINDOWS.md


2017-01-16 19:35 GMT+09:00 assaf.mendelson :

> Hi,
>
> In the documentation it says spark is supported on windows.
>
> The problem, however, is that the documentation description on windows is
> lacking. There are sources (such as https://jaceklaskowski.
> gitbooks.io/mastering-apache-spark/content/spark-tips-and-
> tricks-running-spark-windows.html and many more) which explain how to
> make spark run on windows, however, they all involve downloading a third
> party winutil.exe file.
>
> Since this file is downloaded from a repository belonging to a private
> person, this can be an issue (e.g. getting approval to install on a company
> computer can be an issue).
>
> There are tons of jira tickets on the subject (most are marked as
> duplicate or not a problem), however, I believe that if we say spark is
> supported on windows there should be a clear explanation on how to run it
> and one shouldn’t have to use executable from a private person.
>
>
>
> If indeed using winutil.exe is the correct solution, I believe it should
> be bundled to the spark binary distribution along with clear instructions
> on how to add it.
>
> Assaf.
>
> --
> View this message in context: spark support on windows
> 
> Sent from the Apache Spark Developers List mailing list archive
>  at
> Nabble.com.
>


spark support on windows

2017-01-16 Thread assaf.mendelson
Hi,
In the documentation it says spark is supported on windows.
The problem, however, is that the documentation description on windows is 
lacking. There are sources (such as 
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html
 and many more) which explain how to make spark run on windows, however, they 
all involve downloading a third party winutil.exe file.
Since this file is downloaded from a repository belonging to a private person, 
this can be an issue (e.g. getting approval to install on a company computer 
can be an issue).
There are tons of jira tickets on the subject (most are marked as duplicate or 
not a problem), however, I believe that if we say spark is supported on windows 
there should be a clear explanation on how to run it and one shouldn't have to 
use executable from a private person.

If indeed using winutil.exe is the correct solution, I believe it should be 
bundled to the spark binary distribution along with clear instructions on how 
to add it.
Assaf.




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/spark-support-on-windows-tp20614.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Equally split a RDD partition into two partition at the same node

2017-01-16 Thread Liang-Chi Hsieh

Hi Fei,

I think it should work. But you may need to add few logic in compute() to
decide which half of the parent partition is needed to output. And you need
to get the correct preferred locations for the partitions sharing the same
parent partition.


Fei Hu wrote
> Hi Liang-Chi,
> 
> Yes, you are right. I implement the following solution for this problem,
> and it works. But I am not sure if it is efficient:
> 
> I double the partitions of the parent RDD, and then use the new partitions
> and parent RDD to construct the target RDD. In the compute() function of
> the target RDD, I use the input partition to get the corresponding parent
> partition, and get the half elements in the parent partitions as the
> output
> of the computing function.
> 
> Thanks,
> Fei
> 
> On Sun, Jan 15, 2017 at 11:01 PM, Liang-Chi Hsieh 

> viirya@

>  wrote:
> 
>>
>> Hi,
>>
>> When calling `coalesce` with `shuffle = false`, it is going to produce at
>> most min(numPartitions, previous RDD's number of partitions). So I think
>> it
>> can't be used to double the number of partitions.
>>
>>
>> Anastasios Zouzias wrote
>> > Hi Fei,
>> >
>> > How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?
>> >
>> > https://github.com/apache/spark/blob/branch-1.6/core/
>> src/main/scala/org/apache/spark/rdd/RDD.scala#L395
>> >
>> > coalesce is mostly used for reducing the number of partitions before
>> > writing to HDFS, but it might still be a narrow dependency (satisfying
>> > your
>> > requirements) if you increase the # of partitions.
>> >
>> > Best,
>> > Anastasios
>> >
>> > On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu 
>>
>> > hufei68@
>>
>> >  wrote:
>> >
>> >> Dear all,
>> >>
>> >> I want to equally divide a RDD partition into two partitions. That
>> means,
>> >> the first half of elements in the partition will create a new
>> partition,
>> >> and the second half of elements in the partition will generate another
>> >> new
>> >> partition. But the two new partitions are required to be at the same
>> node
>> >> with their parent partition, which can help get high data locality.
>> >>
>> >> Is there anyone who knows how to implement it or any hints for it?
>> >>
>> >> Thanks in advance,
>> >> Fei
>> >>
>> >>
>> >
>> >
>> > --
>> > -- Anastasios Zouzias
>> > 
>>
>> > azo@.ibm
>>
>> > 
>>
>>
>>
>>
>>
>> -
>> Liang-Chi Hsieh | @viirya
>> Spark Technology Center
>> http://www.spark.tc/
>> --
>> View this message in context: http://apache-spark-
>> developers-list.1001551.n3.nabble.com/Equally-split-a-
>> RDD-partition-into-two-partition-at-the-same-node-tp20597p20608.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe e-mail: 

> dev-unsubscribe@.apache

>>
>>





-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Equally-split-a-RDD-partition-into-two-partition-at-the-same-node-tp20597p20613.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org