[jira] [Commented] (SPARK-25501) Kafka delegation token support

2018-09-27 Thread Mingjie Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630672#comment-16630672
 ] 

Mingjie Tang commented on SPARK-25501:
--

[~gsomogyi] thanks so much for your hard work, let me look at your SPIP, and 
give some comments 

> Kafka delegation token support
> --
>
> Key: SPARK-25501
> URL: https://issues.apache.org/jira/browse/SPARK-25501
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> In kafka version 1.1 delegation token support is released. As spark updated 
> it's kafka client to 2.0.0 now it's possible to implement delegation token 
> support. Please see description: 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-48+Delegation+token+support+for+Kafka



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25501) Kafka delegation token support

2018-09-26 Thread Mingjie Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629320#comment-16629320
 ] 

Mingjie Tang commented on SPARK-25501:
--

[~gsomogyi] Thanks for your reply. 

At first, what my PR proposed here is used for us to discuss,  we can use it or 
disregard it. either way is ok for me. What I want to propose is that we can 
move this ticket asap, since this feature is critical for production and 
community. 

Second, You can build a document for discuss the design and have SPIP.  I can 
learn advices from you and others. This would be useful.  

Finally, thanks so much for you to begin work on this work. Your example is 
very good. Therefore, you can refer my PR or do it by yourself, then, we can 
discuss and move forward this asap. What do you think about this? I hope to 
learn from you. 

> Kafka delegation token support
> --
>
> Key: SPARK-25501
> URL: https://issues.apache.org/jira/browse/SPARK-25501
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> In kafka version 1.1 delegation token support is released. As spark updated 
> it's kafka client to 2.0.0 now it's possible to implement delegation token 
> support. Please see description: 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-48+Delegation+token+support+for+Kafka



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25501) Kafka delegation token support

2018-09-25 Thread Mingjie Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628119#comment-16628119
 ] 

Mingjie Tang commented on SPARK-25501:
--

[~gsomogyi] if you like, you can propose a PR based my PR as well. 
https://github.com/apache/spark/pull/22550

> Kafka delegation token support
> --
>
> Key: SPARK-25501
> URL: https://issues.apache.org/jira/browse/SPARK-25501
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> In kafka version 1.1 delegation token support is released. As spark updated 
> it's kafka client to 2.0.0 now it's possible to implement delegation token 
> support. Please see description: 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-48+Delegation+token+support+for+Kafka



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25501) Kafka delegation token support

2018-09-25 Thread Mingjie Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16627934#comment-16627934
 ] 

Mingjie Tang commented on SPARK-25501:
--

[~gsomogyi] any updates on your PR, I can propose a PR as well. 

> Kafka delegation token support
> --
>
> Key: SPARK-25501
> URL: https://issues.apache.org/jira/browse/SPARK-25501
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> In kafka version 1.1 delegation token support is released. As spark updated 
> it's kafka client to 2.0.0 now it's possible to implement delegation token 
> support. Please see description: 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-48+Delegation+token+support+for+Kafka



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25518) Spark kafka delegation token supported

2018-09-24 Thread Mingjie Tang (JIRA)
Mingjie Tang created SPARK-25518:


 Summary: Spark kafka delegation token supported 
 Key: SPARK-25518
 URL: https://issues.apache.org/jira/browse/SPARK-25518
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.3.1
Reporter: Mingjie Tang


As we can notice, Kafka is going to support delegation 
token[https://cwiki-test.apache.org/confluence/display/KAFKA/KIP-48+Delegation+token+support+for+Kafka].
 Spark need to use the delegation token for Kafka cluster as delegation token 
for HDFS, Hive and  HBase server. 

In this Jira, we are going to track how to provide the delegation token for 
Spark Streaming to read and write data from/to Kafka. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-08-02 Thread Mingjie Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567393#comment-16567393
 ] 

Mingjie Tang commented on SPARK-24615:
--

>From user's perspective, user only concern about the GPU resource for RDD, and 
>do not understand the stage or partitions of RDD. Therefore, underline 
>resource allocation mechanism would  assign the resources to executor 
>automatically. 

Similar as cache or persistence to different level, maybe we can provide 
different configuration to users. Then, resource allocation to follow the 
predefined policy to allocate resource. 

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24374) SPIP: Support Barrier Scheduling in Apache Spark

2018-06-21 Thread Mingjie Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519622#comment-16519622
 ] 

Mingjie Tang commented on SPARK-24374:
--

[~mengxr]. 

[~jerryshao] create a SISP [SPARK-24615] for tasking scheduling for GPU 
resources. [~yanboliang]

 

> SPIP: Support Barrier Scheduling in Apache Spark
> 
>
> Key: SPARK-24374
> URL: https://issues.apache.org/jira/browse/SPARK-24374
> Project: Spark
>  Issue Type: Epic
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: Hydrogen, SPIP
> Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf
>
>
> (See details in the linked/attached SPIP doc.)
> {quote}
> The proposal here is to add a new scheduling model to Apache Spark so users 
> can properly embed distributed DL training as a Spark stage to simplify the 
> distributed training workflow. For example, Horovod uses MPI to implement 
> all-reduce to accelerate distributed TensorFlow training. The computation 
> model is different from MapReduce used by Spark. In Spark, a task in a stage 
> doesn’t depend on any other tasks in the same stage, and hence it can be 
> scheduled independently. In MPI, all workers start at the same time and pass 
> messages around. To embed this workload in Spark, we need to introduce a new 
> scheduling model, tentatively named “barrier scheduling”, which launches 
> tasks at the same time and provides users enough information and tooling to 
> embed distributed DL training. Spark can also provide an extra layer of fault 
> tolerance in case some tasks failed in the middle, where Spark would abort 
> all tasks and restart the stage.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24093) Make some fields of KafkaStreamWriter/InternalRowMicroBatchWriter visible to outside of the classes

2018-06-11 Thread Mingjie Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16508814#comment-16508814
 ] 

Mingjie Tang commented on SPARK-24093:
--

The related topic could be viewed when users want to look at the kafka 
streaming topic information. For example, the third party job listener want to 
track this streaming job and record the topic information. 

> Make some fields of KafkaStreamWriter/InternalRowMicroBatchWriter visible to 
> outside of the classes
> ---
>
> Key: SPARK-24093
> URL: https://issues.apache.org/jira/browse/SPARK-24093
> Project: Spark
>  Issue Type: Wish
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Weiqing Yang
>Priority: Minor
>
> To make third parties able to get the information of streaming writer, for 
> example, the information of "writer" and "topic" which streaming data are 
> written into, this jira is created to make relevant fields of 
> KafkaStreamWriter and InternalRowMicroBatchWriter visible to outside of the 
> classes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24479) Register StreamingQueryListener in Spark Conf

2018-06-06 Thread Mingjie Tang (JIRA)
Mingjie Tang created SPARK-24479:


 Summary: Register StreamingQueryListener in Spark Conf 
 Key: SPARK-24479
 URL: https://issues.apache.org/jira/browse/SPARK-24479
 Project: Spark
  Issue Type: New Feature
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Mingjie Tang


Users need to register their own StreamingQueryListener into 
StreamingQueryManager, the similar function is provided as EXTRA_LISTENERS and 
QUERY_EXECUTION_LISTENERS. 

We propose to provide STREAMING_QUERY_LISTENER Conf for user to register their 
own listener. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24093) Make some fields of KafkaStreamWriter/InternalRowMicroBatchWriter visible to outside of the classes

2018-05-29 Thread Mingjie Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494154#comment-16494154
 ] 

Mingjie Tang commented on SPARK-24093:
--

I made a PR for this 
https://github.com/apache/spark/pull/21455


> Make some fields of KafkaStreamWriter/InternalRowMicroBatchWriter visible to 
> outside of the classes
> ---
>
> Key: SPARK-24093
> URL: https://issues.apache.org/jira/browse/SPARK-24093
> Project: Spark
>  Issue Type: Wish
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Weiqing Yang
>Priority: Minor
>
> To make third parties able to get the information of streaming writer, for 
> example, the information of "writer" and "topic" which streaming data are 
> written into, this jira is created to make relevant fields of 
> KafkaStreamWriter and InternalRowMicroBatchWriter visible to outside of the 
> classes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24093) Make some fields of KafkaStreamWriter/InternalRowMicroBatchWriter visible to outside of the classes

2018-05-25 Thread Mingjie Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491285#comment-16491285
 ] 

Mingjie Tang commented on SPARK-24093:
--

i can add a PR for this. 

> Make some fields of KafkaStreamWriter/InternalRowMicroBatchWriter visible to 
> outside of the classes
> ---
>
> Key: SPARK-24093
> URL: https://issues.apache.org/jira/browse/SPARK-24093
> Project: Spark
>  Issue Type: Wish
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Weiqing Yang
>Priority: Minor
>
> To make third parties able to get the information of streaming writer, for 
> example, the information of "writer" and "topic" which streaming data are 
> written into, this jira is created to make relevant fields of 
> KafkaStreamWriter and InternalRowMicroBatchWriter visible to outside of the 
> classes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23674) Add Spark ML Listener for Tracking ML Pipeline Status

2018-03-13 Thread Mingjie Tang (JIRA)
Mingjie Tang created SPARK-23674:


 Summary: Add Spark ML Listener for Tracking ML Pipeline Status
 Key: SPARK-23674
 URL: https://issues.apache.org/jira/browse/SPARK-23674
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.3.0
Reporter: Mingjie Tang


Currently, Spark provides status monitoring for different components of Spark, 
like spark history server, streaming listener, sql listener and etc. The use 
case would be (1) front UI to track the status of training coverage rate during 
iteration, then DS can understand how the job converge when training, like 
K-means, Logistic and other linear regression model.  (2) tracking the data 
lineage for the input and output of training data.  

In this proposal, we hope to provide Spark ML pipeline listener to track the 
status of Spark ML pipeline status includes: 
 # ML pipeline create and saved 
 # ML pipeline mode created, saved and load  
 # ML model training status monitoring  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18673) Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version

2017-12-08 Thread Mingjie Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16284485#comment-16284485
 ] 

Mingjie Tang commented on SPARK-18673:
--

Hi Steve, the hive side is updated. 
https://github.com/apache/hive/blob/master/shims/common/src/main/java/org/apache/hadoop/hive/shims/ShimLoader.java#L144.
 thus, if we build w.r.t new hive, this unrecognized issue can be fixed. 


> Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
> --
>
> Key: SPARK-18673
> URL: https://issues.apache.org/jira/browse/SPARK-18673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Spark built with -Dhadoop.version=3.0.0-alpha2-SNAPSHOT 
>Reporter: Steve Loughran
>
> Spark Dataframes fail to run on Hadoop 3.0.x, because hive.jar's shimloader 
> considers 3.x to be an unknown Hadoop version.
> Hive itself will have to fix this; as Spark uses its own hive 1.2.x JAR, it 
> will need to be updated to match.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22587) Spark job fails if fs.defaultFS and application jar are different url

2017-12-04 Thread Mingjie Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16273745#comment-16273745
 ] 

Mingjie Tang edited comment on SPARK-22587 at 12/5/17 12:01 AM:


we can update the compareFS by considering the authority. 
https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L1442

The PR is sent out. 
https://github.com/apache/spark/pull/19885



was (Author: merlin):
we can update the compareFS by considering the authority. 
https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L1442

I would send out a PR soon. 

> Spark job fails if fs.defaultFS and application jar are different url
> -
>
> Key: SPARK-22587
> URL: https://issues.apache.org/jira/browse/SPARK-22587
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.6.3
>Reporter: Prabhu Joseph
>
> Spark Job fails if the fs.defaultFs and url where application jar resides are 
> different and having same scheme,
> spark-submit  --conf spark.master=yarn-cluster wasb://XXX/tmp/test.py
> core-site.xml fs.defaultFS is set to wasb:///YYY. Hadoop list works (hadoop 
> fs -ls) works for both the url XXX and YYY.
> {code}
> Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: 
> wasb://XXX/tmp/test.py, expected: wasb://YYY 
> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:665) 
> at 
> org.apache.hadoop.fs.azure.NativeAzureFileSystem.checkPath(NativeAzureFileSystem.java:1251)
>  
> at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:485) 
> at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:396) 
> at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:507)
>  
> at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:660) 
> at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:912)
>  
> at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:172) 
> at org.apache.spark.deploy.yarn.Client.run(Client.scala:1248) 
> at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1307) 
> at org.apache.spark.deploy.yarn.Client.main(Client.scala) 
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.lang.reflect.Method.invoke(Method.java:498) 
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:751)
>  
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) 
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) 
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) 
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 
> {code}
> The code Client.copyFileToRemote tries to resolve the path of application jar 
> (XXX) from the FileSystem object created using fs.defaultFS url (YYY) instead 
> of the actual url of application jar.
> val destFs = destDir.getFileSystem(hadoopConf)
> val srcFs = srcPath.getFileSystem(hadoopConf)
> getFileSystem will create the filesystem based on the url of the path and so 
> this is fine. But the below lines of code tries to get the srcPath (XXX url) 
> from the destFs (YYY url) and so it fails.
> var destPath = srcPath
> val qualifiedDestPath = destFs.makeQualified(destPath)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22587) Spark job fails if fs.defaultFS and application jar are different url

2017-11-30 Thread Mingjie Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16273745#comment-16273745
 ] 

Mingjie Tang commented on SPARK-22587:
--

we can update the compareFS by considering the authority. 
https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L1442

I would send out a PR soon. 

> Spark job fails if fs.defaultFS and application jar are different url
> -
>
> Key: SPARK-22587
> URL: https://issues.apache.org/jira/browse/SPARK-22587
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.6.3
>Reporter: Prabhu Joseph
>
> Spark Job fails if the fs.defaultFs and url where application jar resides are 
> different and having same scheme,
> spark-submit  --conf spark.master=yarn-cluster wasb://XXX/tmp/test.py
> core-site.xml fs.defaultFS is set to wasb:///YYY. Hadoop list works (hadoop 
> fs -ls) works for both the url XXX and YYY.
> {code}
> Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: 
> wasb://XXX/tmp/test.py, expected: wasb://YYY 
> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:665) 
> at 
> org.apache.hadoop.fs.azure.NativeAzureFileSystem.checkPath(NativeAzureFileSystem.java:1251)
>  
> at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:485) 
> at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:396) 
> at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:507)
>  
> at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:660) 
> at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:912)
>  
> at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:172) 
> at org.apache.spark.deploy.yarn.Client.run(Client.scala:1248) 
> at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1307) 
> at org.apache.spark.deploy.yarn.Client.main(Client.scala) 
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.lang.reflect.Method.invoke(Method.java:498) 
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:751)
>  
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) 
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) 
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) 
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 
> {code}
> The code Client.copyFileToRemote tries to resolve the path of application jar 
> (XXX) from the FileSystem object created using fs.defaultFS url (YYY) instead 
> of the actual url of application jar.
> val destFs = destDir.getFileSystem(hadoopConf)
> val srcFs = srcPath.getFileSystem(hadoopConf)
> getFileSystem will create the filesystem based on the url of the path and so 
> this is fine. But the below lines of code tries to get the srcPath (XXX url) 
> from the destFs (YYY url) and so it fails.
> var destPath = srcPath
> val qualifiedDestPath = destFs.makeQualified(destPath)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19771) Support OR-AND amplification in Locality Sensitive Hashing (LSH)

2017-03-02 Thread Mingjie Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15893117#comment-15893117
 ] 

Mingjie Tang commented on SPARK-19771:
--

(1) because you need to explode each tuple. For example mentioned above, for 
one input tuple, you have to build 3 rows, and each hashvalue contain a vector 
is the length of hash functions. thus, for one tuple, your memory overhead is 
NumHashFunctions*NumHashTables=15. Thus, if the number input tuple is N, the 
overhead is NumHashFunctions*NumHashTables*N. 

(2) yes, the hashvalue can be any based on your input bucketwidth W. Actually, 
it should be very big for less collision.

(3) I am not sure the hashCode can work, because we need to use this function 
for multi-probe searching.  

> Support OR-AND amplification in Locality Sensitive Hashing (LSH)
> 
>
> Key: SPARK-19771
> URL: https://issues.apache.org/jira/browse/SPARK-19771
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Yun Ni
>
> The current LSH implementation only supports AND-OR amplification. We need to 
> discuss the following questions before we goes to implementations:
> (1) Whether we should support OR-AND amplification
> (2) What API changes we need for OR-AND amplification
> (3) How we fix the approxNearestNeighbor and approxSimilarityJoin internally.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18454) Changes to improve Nearest Neighbor Search for LSH

2017-03-02 Thread Mingjie Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15893087#comment-15893087
 ] 

Mingjie Tang commented on SPARK-18454:
--

[~yunn] the current multi-probe NNS can be improved without building index. 

 



> Changes to improve Nearest Neighbor Search for LSH
> --
>
> Key: SPARK-18454
> URL: https://issues.apache.org/jira/browse/SPARK-18454
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yun Ni
>
> We all agree to do the following improvement to Multi-Probe NN Search:
> (1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing 
> full sort on the whole dataset
> Currently we are still discussing the following:
> (1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}}
> (2) What are the issues and how we should change the current Nearest Neighbor 
> implementation



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19771) Support OR-AND amplification in Locality Sensitive Hashing (LSH)

2017-02-28 Thread Mingjie Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889180#comment-15889180
 ] 

Mingjie Tang edited comment on SPARK-19771 at 3/1/17 12:30 AM:
---

If we follow the AND-OR framework, one optimization is here: 

At first, suppose we use BucketedRandomProjectionLSH, and  setNumHashTables(3) 
and setNumHashFunctions(5).

For one input vector with sparse format e.g., [(6,[0,1,2],[1.0,1.0,1.0])], 
we can compute its mapped hash vectors is
WrappedArray([-1.0,-1.0,-1.0,0.0,0.0], [-1.0,0.0,0.0,-1.0,-1.0], 
[-1.0,-1.0,-1.0,-1.0,0.0])]

For the similarity-join, we map this computed vector into
++-++
|datasetA|entry|   hashValue|
++-++
|[0,(6,[0,1,2],[1|0|[-1.0,-1.0,-1.0,0...|
|[0,(6,[0,1,2],[1|1|[-1.0,0.0,0.0,-1|
|[0,(6,[0,1,2],[1|2|[-1.0,-1.0,-1.0,-...|

Then, based on AND-OR principle, we using the entry and hashValue for 
equip-join with other table . When we look at the the sql, it need to iterate 
through the hashValue vector for equal join. Thus, this computation and memory 
usage cost is very high. For example, if we apply the nest-loop for comparing 
two vectors with length of NumHashFunctions, the computation cost is 
(NumHashFunctions* NumHashFunctions) and memory overhead is (N* 
NumHashFunctions). 

Therefore, we can apply one optimization technique here. that is, we do not 
need to store the hash value for each hash table like [-1.0,-1.0,-1.0,0.0,0.0], 
instead, we use a simple hash function to improve this. 

Suppose h_l= {h_0, h_1., ... h_k}, where k is the number of 
setNumHashFunctions. 
then, the mapped hash function is 
 H(h_l)=sum_{i =0...k} (h_i*r_i) mod prime.
where r_i is a integer  and the prime can be set as 2^32-5 for less hash 
collision.
Then, we only need to store the hash value H(h_l) for AND operation.  

The similar approach also can be applied for the approxNeasetNeighbors 
searching. 

If the multi-probe approach does not need to read the hash value for one hash 
function (e.g., h_l mentioned above), we can applied this H(h_l) to the 
preprocessed data to save memory. I will double check the multi-probe approach 
and see whether it is work.  Then, I can submit a patch to improve the current 
way. Note, this hash function to reduce the vector-to-vector comparing is 
widely used in different places for example, the LSH implementation by Andoni . 
 
[~yunn] [~mlnick] 



was (Author: merlintang):
If we follow the AND-OR framework, one optimization is here: 

At first, suppose we use BucketedRandomProjectionLSH, and  setNumHashTables(3) 
and setNumHashFunctions(5).

For one input vector with sparse format e.g., [(6,[0,1,2],[1.0,1.0,1.0])], 
we can compute its mapped hash vectors is
WrappedArray([-1.0,-1.0,-1.0,0.0,0.0], [-1.0,0.0,0.0,-1.0,-1.0], 
[-1.0,-1.0,-1.0,-1.0,0.0])]

For the similarity-join, we map this computed vector into
++-++
|datasetA|entry|   hashValue|
++-++
|[0,(6,[0,1,2],[1|0|[-1.0,-1.0,-1.0,0...|
|[0,(6,[0,1,2],[1|1|[-1.0,0.0,0.0,-1|
|[0,(6,[0,1,2],[1|2|[-1.0,-1.0,-1.0,-...|

Then, based on AND-OR principle, we using the entry and hashValue for 
equip-join with other table . When we look at the the sql, it need to iterate 
through the hashValue vector for equal join. Thus, this computation and memory 
usage cost is very high. For example, if we apply the nest-loop for comparing 
two vectors with length of NumHashFunctions, the computation cost is 
(NumHashFunctions* NumHashFunctions) and memory overhead is (N* 
NumHashFunctions). 

Therefore, we can apply one optimization technique here. that is, we do not 
need to store the hash value for each hash table like [-1.0,-1.0,-1.0,0.0,0.0], 
instead, we use a simple hash function to improve this. 

Suppose h_l= {h_0, h_1., ... h_k}, where k is the number of 
setNumHashFunctions. 
then, the mapped hash function is 
 H(h_l)=sum_{i =0...k} (h_i*r_i) mod prime.
where r_i is a integer  and the prime can be set as 2^32-5 for less hash 
collision.
Then, we only need to store the hash value H(h_l) for AND operation.  

The similar approach also can be applied for the approxNeasetNeighbors 
searching. 

If the multi-probe approach does not need to read the hash value for one hash 
function (e.g., h_l mentioned above), we can applied this H(h_l) to the 
preprocessed data to save memory. I will double check the multi-probe approach 
and see whether it is work.  Then, I can submit a patch to improve the current 
way. Note, this hash function to reduce the vector-to-vector comparing is 
widely used in different places for example, the LSH implementation by Andoni . 
 
[~diefunction] [~mlnick] 


> Support OR-AND 

[jira] [Commented] (SPARK-19771) Support OR-AND amplification in Locality Sensitive Hashing (LSH)

2017-02-28 Thread Mingjie Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889180#comment-15889180
 ] 

Mingjie Tang commented on SPARK-19771:
--

If we follow the AND-OR framework, one optimization is here: 

At first, suppose we use BucketedRandomProjectionLSH, and  setNumHashTables(3) 
and setNumHashFunctions(5).

For one input vector with sparse format e.g., [(6,[0,1,2],[1.0,1.0,1.0])], 
we can compute its mapped hash vectors is
WrappedArray([-1.0,-1.0,-1.0,0.0,0.0], [-1.0,0.0,0.0,-1.0,-1.0], 
[-1.0,-1.0,-1.0,-1.0,0.0])]

For the similarity-join, we map this computed vector into
++-++
|datasetA|entry|   hashValue|
++-++
|[0,(6,[0,1,2],[1|0|[-1.0,-1.0,-1.0,0...|
|[0,(6,[0,1,2],[1|1|[-1.0,0.0,0.0,-1|
|[0,(6,[0,1,2],[1|2|[-1.0,-1.0,-1.0,-...|

Then, based on AND-OR principle, we using the entry and hashValue for 
equip-join with other table . When we look at the the sql, it need to iterate 
through the hashValue vector for equal join. Thus, this computation and memory 
usage cost is very high. For example, if we apply the nest-loop for comparing 
two vectors with length of NumHashFunctions, the computation cost is 
(NumHashFunctions* NumHashFunctions) and memory overhead is (N* 
NumHashFunctions). 

Therefore, we can apply one optimization technique here. that is, we do not 
need to store the hash value for each hash table like [-1.0,-1.0,-1.0,0.0,0.0], 
instead, we use a simple hash function to improve this. 

Suppose h_l= {h_0, h_1., ... h_k}, where k is the number of 
setNumHashFunctions. 
then, the mapped hash function is 
 H(h_l)=sum_{i =0...k} (h_i*r_i) mod prime.
where r_i is a integer  and the prime can be set as 2^32-5 for less hash 
collision.
Then, we only need to store the hash value H(h_l) for AND operation.  

The similar approach also can be applied for the approxNeasetNeighbors 
searching. 

If the multi-probe approach does not need to read the hash value for one hash 
function (e.g., h_l mentioned above), we can applied this H(h_l) to the 
preprocessed data to save memory. I will double check the multi-probe approach 
and see whether it is work.  Then, I can submit a patch to improve the current 
way. Note, this hash function to reduce the vector-to-vector comparing is 
widely used in different places for example, the LSH implementation by Andoni . 
 
[~diefunction] [~mlnick] 


> Support OR-AND amplification in Locality Sensitive Hashing (LSH)
> 
>
> Key: SPARK-19771
> URL: https://issues.apache.org/jira/browse/SPARK-19771
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Yun Ni
>
> The current LSH implementation only supports AND-OR amplification. We need to 
> discuss the following questions before we goes to implementations:
> (1) Whether we should support OR-AND amplification
> (2) What API changes we need for OR-AND amplification
> (3) How we fix the approxNearestNeighbor and approxSimilarityJoin internally.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18454) Changes to improve Nearest Neighbor Search for LSH

2017-02-22 Thread Mingjie Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879539#comment-15879539
 ] 

Mingjie Tang commented on SPARK-18454:
--

[~yunn] I leave some comments on the document. the build index over the input 
data would be very useful, if we do not shuffle the input data table. 

> Changes to improve Nearest Neighbor Search for LSH
> --
>
> Key: SPARK-18454
> URL: https://issues.apache.org/jira/browse/SPARK-18454
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yun Ni
>
> We all agree to do the following improvement to Multi-Probe NN Search:
> (1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing 
> full sort on the whole dataset
> Currently we are still discussing the following:
> (1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}}
> (2) What are the issues and how we should change the current Nearest Neighbor 
> implementation



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18392) LSH API, algorithm, and documentation follow-ups

2017-02-14 Thread Mingjie Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867064#comment-15867064
 ] 

Mingjie Tang commented on SPARK-18392:
--

Sure, AND-amp is important and basic for current LSH. We can allocate the 
efforts for AND-amplification and new hash functions. \cc [~yunn] 

> LSH API, algorithm, and documentation follow-ups
> 
>
> Key: SPARK-18392
> URL: https://issues.apache.org/jira/browse/SPARK-18392
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA summarizes discussions from the initial LSH PR 
> [https://github.com/apache/spark/pull/15148] as well as the follow-up for 
> hash distance [https://github.com/apache/spark/pull/15800].  This will be 
> broken into subtasks:
> * API changes (targeted for 2.1)
> * algorithmic fixes (targeted for 2.1)
> * documentation improvements (ideally 2.1, but could slip)
> The major issues we have mentioned are as follows:
> * OR vs AND amplification
> ** Need to make API flexible enough to support both types of amplification in 
> the future
> ** Need to clarify which we support, including in each model function 
> (transform, similarity join, neighbors)
> * Need to clarify which algorithms we have implemented, improve docs and 
> references, and fix the algorithms if needed.
> These major issues are broken down into detailed issues below.
> h3. LSH abstraction
> * Rename {{outputDim}} to something indicative of OR-amplification.
> ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used 
> in the future for AND amplification (Thanks [~mlnick]!)
> * transform
> ** Update output schema to {{Array of Vector}} instead of {{Vector}}.  This 
> is the "raw" output of all hash functions, i.e., with no aggregation for 
> amplification.
> ** Clarify meaning of output in terms of multiple hash functions and 
> amplification.
> ** Note: We will _not_ worry about users using this output for dimensionality 
> reduction; if anything, that use case can be explained in the User Guide.
> * Documentation
> ** Clarify terminology used everywhere
> *** hash function {{h_i}}: basic hash function without amplification
> *** hash value {{h_i(key)}}: output of a hash function
> *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with 
> AND-amplification using K base hash functions
> *** compound hash function value {{g(key)}}: vector-valued output
> *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with 
> OR-amplification using L compound hash functions
> *** hash table value {{H(key)}}: output of array of vectors
> *** This terminology is largely pulled from Wang et al.'s survey and the 
> multi-probe LSH paper.
> ** Link clearly to documentation (Wikipedia or papers) which matches our 
> terminology and what we implemented
> h3. RandomProjection (or P-Stable Distributions)
> * Rename {{RandomProjection}}
> ** Options include: {{ScalarRandomProjectionLSH}}, 
> {{BucketedRandomProjectionLSH}}, {{PStableLSH}}
> * API privacy
> ** Make randUnitVectors private
> * hashFunction
> ** Currently, this uses OR-amplification for single probing, as we intended.
> ** It does *not* do multiple probing, at least not in the sense of the 
> original MPLSH paper.  We should fix that or at least document its behavior.
> * Documentation
> ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia
> ** Also link to the multi-probe LSH paper since that explains this method 
> very clearly.
> ** Clarify hash function and distance metric
> h3. MinHash
> * Rename {{MinHash}} -> {{MinHashLSH}}
> * API privacy
> ** Make randCoefficients, numEntries private
> * hashDistance (used in approxNearestNeighbors)
> ** Update to use average of indicators of hash collisions [SPARK-18334]
> ** See [Wikipedia | 
> https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a 
> reference
> h3. All references
> I'm just listing references I looked at.
> Papers
> * [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf]
> * [https://people.csail.mit.edu/indyk/p117-andoni.pdf]
> * [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf]
> * [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe 
> LSH paper
> Wikipedia
> * 
> [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search]
> * [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18392) LSH API, algorithm, and documentation follow-ups

2017-02-14 Thread mingjie tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867041#comment-15867041
 ] 

mingjie tang commented on SPARK-18392:
--

[~yunn] are you working on the BitSampling & SignRandomProjection function, if 
not, I can work on them this week. 

> LSH API, algorithm, and documentation follow-ups
> 
>
> Key: SPARK-18392
> URL: https://issues.apache.org/jira/browse/SPARK-18392
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA summarizes discussions from the initial LSH PR 
> [https://github.com/apache/spark/pull/15148] as well as the follow-up for 
> hash distance [https://github.com/apache/spark/pull/15800].  This will be 
> broken into subtasks:
> * API changes (targeted for 2.1)
> * algorithmic fixes (targeted for 2.1)
> * documentation improvements (ideally 2.1, but could slip)
> The major issues we have mentioned are as follows:
> * OR vs AND amplification
> ** Need to make API flexible enough to support both types of amplification in 
> the future
> ** Need to clarify which we support, including in each model function 
> (transform, similarity join, neighbors)
> * Need to clarify which algorithms we have implemented, improve docs and 
> references, and fix the algorithms if needed.
> These major issues are broken down into detailed issues below.
> h3. LSH abstraction
> * Rename {{outputDim}} to something indicative of OR-amplification.
> ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used 
> in the future for AND amplification (Thanks [~mlnick]!)
> * transform
> ** Update output schema to {{Array of Vector}} instead of {{Vector}}.  This 
> is the "raw" output of all hash functions, i.e., with no aggregation for 
> amplification.
> ** Clarify meaning of output in terms of multiple hash functions and 
> amplification.
> ** Note: We will _not_ worry about users using this output for dimensionality 
> reduction; if anything, that use case can be explained in the User Guide.
> * Documentation
> ** Clarify terminology used everywhere
> *** hash function {{h_i}}: basic hash function without amplification
> *** hash value {{h_i(key)}}: output of a hash function
> *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with 
> AND-amplification using K base hash functions
> *** compound hash function value {{g(key)}}: vector-valued output
> *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with 
> OR-amplification using L compound hash functions
> *** hash table value {{H(key)}}: output of array of vectors
> *** This terminology is largely pulled from Wang et al.'s survey and the 
> multi-probe LSH paper.
> ** Link clearly to documentation (Wikipedia or papers) which matches our 
> terminology and what we implemented
> h3. RandomProjection (or P-Stable Distributions)
> * Rename {{RandomProjection}}
> ** Options include: {{ScalarRandomProjectionLSH}}, 
> {{BucketedRandomProjectionLSH}}, {{PStableLSH}}
> * API privacy
> ** Make randUnitVectors private
> * hashFunction
> ** Currently, this uses OR-amplification for single probing, as we intended.
> ** It does *not* do multiple probing, at least not in the sense of the 
> original MPLSH paper.  We should fix that or at least document its behavior.
> * Documentation
> ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia
> ** Also link to the multi-probe LSH paper since that explains this method 
> very clearly.
> ** Clarify hash function and distance metric
> h3. MinHash
> * Rename {{MinHash}} -> {{MinHashLSH}}
> * API privacy
> ** Make randCoefficients, numEntries private
> * hashDistance (used in approxNearestNeighbors)
> ** Update to use average of indicators of hash collisions [SPARK-18334]
> ** See [Wikipedia | 
> https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a 
> reference
> h3. All references
> I'm just listing references I looked at.
> Papers
> * [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf]
> * [https://people.csail.mit.edu/indyk/p117-andoni.pdf]
> * [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf]
> * [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe 
> LSH paper
> Wikipedia
> * 
> [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search]
> * [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up

2016-11-10 Thread mingjie tang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mingjie tang updated SPARK-18372:
-
Fix Version/s: 2.0.2

> .Hive-staging folders created from Spark hiveContext are not getting cleaned 
> up
> ---
>
> Key: SPARK-18372
> URL: https://issues.apache.org/jira/browse/SPARK-18372
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.2, 1.6.3
> Environment: spark standalone and spark yarn 
>Reporter: mingjie tang
> Fix For: 2.0.2
>
> Attachments: _thumb_37664.png
>
>
> Steps to reproduce:
> 
> 1. Launch spark-shell 
> 2. Run the following scala code via Spark-Shell 
> scala> val hivesampletabledf = sqlContext.table("hivesampletable") 
> scala> import org.apache.spark.sql.DataFrameWriter 
> scala> val dfw : DataFrameWriter = hivesampletabledf.write 
> scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( 
> clientid string, querytime string, market string, deviceplatform string, 
> devicemake string, devicemodel string, state string, country string, 
> querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") 
> scala> dfw.insertInto("hivesampletablecopypy") 
> scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, 
> querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE 
> state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """)
> hivesampletablecopypydfdf.show
> 3. in HDFS (in our case, WASB), we can see the following folders 
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666
>  
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1
>  
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693
> the issue is that these don't get cleaned up and get accumulated
> =
> with the customer, we have tried setting "SET 
> hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any 
> difference.
> .hive-staging folders are created under the  folder - 
> hive/warehouse/hivesampletablecopypy/
> we have tried adding this property to hive-site.xml and restart the 
> components -
>  
> hive.exec.stagingdir 
> $ {hive.exec.scratchdir}
> /$
> {user.name}
> /.staging 
> 
> a new .hive-staging folder was created in hive/warehouse/ folder
> moreover, please understand that if we run the hive query in pure Hive via 
> Hive CLI on the same Spark cluster, we don't see the behavior
> so it doesn't appear to be a Hive issue/behavior in this case- this is a 
> spark behavior
> I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark 
> configuration already
> The issue happens via Spark-submit as well - customer used the following 
> command to reproduce this -
> spark-submit test-hive-staging-cleanup.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up

2016-11-08 Thread mingjie tang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mingjie tang updated SPARK-18372:
-
Attachment: _thumb_37664.png

the staging directory fail to be removed when hive table in the hdfs. 

> .Hive-staging folders created from Spark hiveContext are not getting cleaned 
> up
> ---
>
> Key: SPARK-18372
> URL: https://issues.apache.org/jira/browse/SPARK-18372
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.2, 1.6.3
> Environment: spark standalone and spark yarn 
>Reporter: mingjie tang
> Fix For: 2.0.1
>
> Attachments: _thumb_37664.png
>
>
> Steps to reproduce:
> 
> 1. Launch spark-shell 
> 2. Run the following scala code via Spark-Shell 
> scala> val hivesampletabledf = sqlContext.table("hivesampletable") 
> scala> import org.apache.spark.sql.DataFrameWriter 
> scala> val dfw : DataFrameWriter = hivesampletabledf.write 
> scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( 
> clientid string, querytime string, market string, deviceplatform string, 
> devicemake string, devicemodel string, state string, country string, 
> querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") 
> scala> dfw.insertInto("hivesampletablecopypy") 
> scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, 
> querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE 
> state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """)
> hivesampletablecopypydfdf.show
> 3. in HDFS (in our case, WASB), we can see the following folders 
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666
>  
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1
>  
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693
> the issue is that these don't get cleaned up and get accumulated
> =
> with the customer, we have tried setting "SET 
> hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any 
> difference.
> .hive-staging folders are created under the  folder - 
> hive/warehouse/hivesampletablecopypy/
> we have tried adding this property to hive-site.xml and restart the 
> components -
>  
> hive.exec.stagingdir 
> $ {hive.exec.scratchdir}
> /$
> {user.name}
> /.staging 
> 
> a new .hive-staging folder was created in hive/warehouse/ folder
> moreover, please understand that if we run the hive query in pure Hive via 
> Hive CLI on the same Spark cluster, we don't see the behavior
> so it doesn't appear to be a Hive issue/behavior in this case- this is a 
> spark behavior
> I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark 
> configuration already
> The issue happens via Spark-submit as well - customer used the following 
> command to reproduce this -
> spark-submit test-hive-staging-cleanup.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up

2016-11-08 Thread mingjie tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649935#comment-15649935
 ] 

mingjie tang commented on SPARK-18372:
--

This bug can be reproduced by the following codes: 

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("CREATE TABLE IF NOT EXISTS T1 (key INT, value STRING)")
sqlContext.sql("LOAD DATA LOCAL INPATH '../examples/src/main/resources/kv1.txt' 
INTO TABLE T1")
sqlContext.sql("CREATE TABLE IF NOT EXISTS T2 (key INT, value STRING)")
val sparktestdf = sqlContext.table("T1")
val dfw  = sparktestdf.write
dfw.insertInto("T2")
val sparktestcopypydfdf = sqlContext.sql("""SELECT *  from T2  """) 
sparktestcopypydfdf.show

After user quit the spark-shell, the related .staging directory generated by 
hive writer will not be removed. 
For example: 
the hive table in the directory: /user/hive/warehouse/t2
drwxr-xrwx  3 root  wheel   102 Oct 28 15:02 
.hive-staging_hive_2016-10-28_15-02-43_288_7070526396398178792-1


> .Hive-staging folders created from Spark hiveContext are not getting cleaned 
> up
> ---
>
> Key: SPARK-18372
> URL: https://issues.apache.org/jira/browse/SPARK-18372
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.2, 1.6.3
> Environment: spark standalone and spark yarn 
>Reporter: mingjie tang
> Fix For: 2.0.1
>
>
> Steps to reproduce:
> 
> 1. Launch spark-shell 
> 2. Run the following scala code via Spark-Shell 
> scala> val hivesampletabledf = sqlContext.table("hivesampletable") 
> scala> import org.apache.spark.sql.DataFrameWriter 
> scala> val dfw : DataFrameWriter = hivesampletabledf.write 
> scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( 
> clientid string, querytime string, market string, deviceplatform string, 
> devicemake string, devicemodel string, state string, country string, 
> querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") 
> scala> dfw.insertInto("hivesampletablecopypy") 
> scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, 
> querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE 
> state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """)
> hivesampletablecopypydfdf.show
> 3. in HDFS (in our case, WASB), we can see the following folders 
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666
>  
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1
>  
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693
> the issue is that these don't get cleaned up and get accumulated
> =
> with the customer, we have tried setting "SET 
> hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any 
> difference.
> .hive-staging folders are created under the  folder - 
> hive/warehouse/hivesampletablecopypy/
> we have tried adding this property to hive-site.xml and restart the 
> components -
>  
> hive.exec.stagingdir 
> $ {hive.exec.scratchdir}
> /$
> {user.name}
> /.staging 
> 
> a new .hive-staging folder was created in hive/warehouse/ folder
> moreover, please understand that if we run the hive query in pure Hive via 
> Hive CLI on the same Spark cluster, we don't see the behavior
> so it doesn't appear to be a Hive issue/behavior in this case- this is a 
> spark behavior
> I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark 
> configuration already
> The issue happens via Spark-submit as well - customer used the following 
> command to reproduce this -
> spark-submit test-hive-staging-cleanup.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up

2016-11-08 Thread mingjie tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649397#comment-15649397
 ] 

mingjie tang commented on SPARK-18372:
--

Solution: 
This bug is reported by customers.
The reason is the org.spark.sql.hive.InsertIntoHiveTable call the hive class of 
(org.apache.hadoop.hive.) to create the staging directory. Default, from the 
hive side, this staging file would be removed after the hive session is 
expired. However, spark fail to notify the hive to remove the staging files.
Thus, follow the code of spark 2.0.x, I just write one function inside the 
InsertIntoHiveTable to create the .staging directory, then, after the session 
expired of spark, this .staging directory would be removed.
This update is tested for the spark 1.5.2 and spark 1.6.3, and the push request 
is : 

For the test, I have manually checking .staging files from table belong 
directory after the spark shell close. 

> .Hive-staging folders created from Spark hiveContext are not getting cleaned 
> up
> ---
>
> Key: SPARK-18372
> URL: https://issues.apache.org/jira/browse/SPARK-18372
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.2, 1.6.3
> Environment: spark standalone and spark yarn 
>Reporter: mingjie tang
> Fix For: 2.0.1
>
>
> Steps to reproduce:
> 
> 1. Launch spark-shell 
> 2. Run the following scala code via Spark-Shell 
> scala> val hivesampletabledf = sqlContext.table("hivesampletable") 
> scala> import org.apache.spark.sql.DataFrameWriter 
> scala> val dfw : DataFrameWriter = hivesampletabledf.write 
> scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( 
> clientid string, querytime string, market string, deviceplatform string, 
> devicemake string, devicemodel string, state string, country string, 
> querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") 
> scala> dfw.insertInto("hivesampletablecopypy") 
> scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, 
> querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE 
> state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """)
> hivesampletablecopypydfdf.show
> 3. in HDFS (in our case, WASB), we can see the following folders 
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666
>  
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1
>  
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693
> the issue is that these don't get cleaned up and get accumulated
> =
> with the customer, we have tried setting "SET 
> hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any 
> difference.
> .hive-staging folders are created under the  folder - 
> hive/warehouse/hivesampletablecopypy/
> we have tried adding this property to hive-site.xml and restart the 
> components -
>  
> hive.exec.stagingdir 
> $ {hive.exec.scratchdir}
> /$
> {user.name}
> /.staging 
> 
> a new .hive-staging folder was created in hive/warehouse/ folder
> moreover, please understand that if we run the hive query in pure Hive via 
> Hive CLI on the same Spark cluster, we don't see the behavior
> so it doesn't appear to be a Hive issue/behavior in this case- this is a 
> spark behavior
> I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark 
> configuration already
> The issue happens via Spark-submit as well - customer used the following 
> command to reproduce this -
> spark-submit test-hive-staging-cleanup.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up

2016-11-08 Thread mingjie tang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mingjie tang updated SPARK-18372:
-
Description: 
Steps to reproduce:

1. Launch spark-shell 
2. Run the following scala code via Spark-Shell 
scala> val hivesampletabledf = sqlContext.table("hivesampletable") 
scala> import org.apache.spark.sql.DataFrameWriter 
scala> val dfw : DataFrameWriter = hivesampletabledf.write 
scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( 
clientid string, querytime string, market string, deviceplatform string, 
devicemake string, devicemodel string, state string, country string, 
querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") 
scala> dfw.insertInto("hivesampletablecopypy") 
scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, 
querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE 
state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """)
hivesampletablecopypydfdf.show
3. in HDFS (in our case, WASB), we can see the following folders 
hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666
 
hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1
 
hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693
the issue is that these don't get cleaned up and get accumulated
=
with the customer, we have tried setting "SET hive.exec.stagingdir=/tmp/hive;" 
in hive-site.xml - didn't make any difference.
.hive-staging folders are created under the  folder - 
hive/warehouse/hivesampletablecopypy/
we have tried adding this property to hive-site.xml and restart the components -
 
hive.exec.stagingdir 
$ {hive.exec.scratchdir}
/$
{user.name}
/.staging 

a new .hive-staging folder was created in hive/warehouse/ folder
moreover, please understand that if we run the hive query in pure Hive via Hive 
CLI on the same Spark cluster, we don't see the behavior
so it doesn't appear to be a Hive issue/behavior in this case- this is a spark 
behavior
I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark 
configuration already
The issue happens via Spark-submit as well - customer used the following 
command to reproduce this -
spark-submit test-hive-staging-cleanup.py


  was:
Steps to reproduce:

1. Launch spark-shell 
2. Run the following scala code via Spark-Shell 
scala> val hivesampletabledf = sqlContext.table("hivesampletable") 
scala> import org.apache.spark.sql.DataFrameWriter 
scala> val dfw : DataFrameWriter = hivesampletabledf.write 
scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( 
clientid string, querytime string, market string, deviceplatform string, 
devicemake string, devicemodel string, state string, country string, 
querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") 
scala> dfw.insertInto("hivesampletablecopypy") 
scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, 
querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE 
state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """)
hivesampletablecopypydfdf.show
3. in HDFS (in our case, WASB), we can see the following folders 
hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666
 
hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1
 
hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693
the issue is that these don't get cleaned up and get accumulated
=
with the customer, we have tried setting "SET hive.exec.stagingdir=/tmp/hive;" 
in hive-site.xml - didn't make any difference.
.hive-staging folders are created under the  folder - 
hive/warehouse/hivesampletablecopypy/
we have tried adding this property to hive-site.xml and restart the components -
 
hive.exec.stagingdir 
$ {hive.exec.scratchdir}
/$
{user.name}
/.staging 

a new .hive-staging folder was created in hive/warehouse/ folder
moreover, please understand that if we run the hive query in pure Hive via Hive 
CLI on the same Spark cluster, we don't see the behavior
so it doesn't appear to be a Hive issue/behavior in this case- this is a spark 
behavior
I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark 
configuration already
The issue happens via Spark-submit as well - customer used the following 
command to reproduce this -
spark-submit test-hive-staging-cleanup.py

Solution: 
This bug is reported by customers.
The reason is the org.spark.sql.hive.InsertIntoHiveTable call the hive class of 
(org.apache.hadoop.hive.) to create the staging directory. Default, from the 
hive side, this staging file would be removed after the hive session is 

[jira] [Commented] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up

2016-11-08 Thread mingjie tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649386#comment-15649386
 ] 

mingjie tang commented on SPARK-18372:
--

the PR is https://github.com/apache/spark/pull/15819

> .Hive-staging folders created from Spark hiveContext are not getting cleaned 
> up
> ---
>
> Key: SPARK-18372
> URL: https://issues.apache.org/jira/browse/SPARK-18372
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.2, 1.6.3
> Environment: spark standalone and spark yarn 
>Reporter: mingjie tang
> Fix For: 2.0.1
>
>
> Steps to reproduce:
> 
> 1. Launch spark-shell 
> 2. Run the following scala code via Spark-Shell 
> scala> val hivesampletabledf = sqlContext.table("hivesampletable") 
> scala> import org.apache.spark.sql.DataFrameWriter 
> scala> val dfw : DataFrameWriter = hivesampletabledf.write 
> scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( 
> clientid string, querytime string, market string, deviceplatform string, 
> devicemake string, devicemodel string, state string, country string, 
> querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") 
> scala> dfw.insertInto("hivesampletablecopypy") 
> scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, 
> querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE 
> state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """)
> hivesampletablecopypydfdf.show
> 3. in HDFS (in our case, WASB), we can see the following folders 
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666
>  
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1
>  
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693
> the issue is that these don't get cleaned up and get accumulated
> =
> with the customer, we have tried setting "SET 
> hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any 
> difference.
> .hive-staging folders are created under the  folder - 
> hive/warehouse/hivesampletablecopypy/
> we have tried adding this property to hive-site.xml and restart the 
> components -
>  
> hive.exec.stagingdir 
> $ {hive.exec.scratchdir}
> /$
> {user.name}
> /.staging 
> 
> a new .hive-staging folder was created in hive/warehouse/ folder
> moreover, please understand that if we run the hive query in pure Hive via 
> Hive CLI on the same Spark cluster, we don't see the behavior
> so it doesn't appear to be a Hive issue/behavior in this case- this is a 
> spark behavior
> I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark 
> configuration already
> The issue happens via Spark-submit as well - customer used the following 
> command to reproduce this -
> spark-submit test-hive-staging-cleanup.py
> Solution: 
> This bug is reported by customers.
> The reason is the org.spark.sql.hive.InsertIntoHiveTable call the hive class 
> of (org.apache.hadoop.hive.) to create the staging directory. Default, from 
> the hive side, this staging file would be removed after the hive session is 
> expired. However, spark fail to notify the hive to remove the staging files.
> Thus, follow the code of spark 2.0.x, I just write one function inside the 
> InsertIntoHiveTable to create the .staging directory, then, after the session 
> expired of spark, this .staging directory would be removed.
> This update is tested for the spark 1.5.2 and spark 1.6.3, and the push 
> request is : 
> For the test, I have manually checking .staging files from table belong 
> directory after the spark shell close. meanwhile, please advise how to write 
> the test case? because the directory for the related tables can not get.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up

2016-11-08 Thread mingjie tang (JIRA)
mingjie tang created SPARK-18372:


 Summary: .Hive-staging folders created from Spark hiveContext are 
not getting cleaned up
 Key: SPARK-18372
 URL: https://issues.apache.org/jira/browse/SPARK-18372
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.2, 1.5.2, 1.6.3
 Environment: spark standalone and spark yarn 
Reporter: mingjie tang
 Fix For: 2.0.1


Steps to reproduce:

1. Launch spark-shell 
2. Run the following scala code via Spark-Shell 
scala> val hivesampletabledf = sqlContext.table("hivesampletable") 
scala> import org.apache.spark.sql.DataFrameWriter 
scala> val dfw : DataFrameWriter = hivesampletabledf.write 
scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( 
clientid string, querytime string, market string, deviceplatform string, 
devicemake string, devicemodel string, state string, country string, 
querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") 
scala> dfw.insertInto("hivesampletablecopypy") 
scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, 
querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE 
state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """)
hivesampletablecopypydfdf.show
3. in HDFS (in our case, WASB), we can see the following folders 
hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666
 
hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1
 
hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693
the issue is that these don't get cleaned up and get accumulated
=
with the customer, we have tried setting "SET hive.exec.stagingdir=/tmp/hive;" 
in hive-site.xml - didn't make any difference.
.hive-staging folders are created under the  folder - 
hive/warehouse/hivesampletablecopypy/
we have tried adding this property to hive-site.xml and restart the components -
 
hive.exec.stagingdir 
$ {hive.exec.scratchdir}
/$
{user.name}
/.staging 

a new .hive-staging folder was created in hive/warehouse/ folder
moreover, please understand that if we run the hive query in pure Hive via Hive 
CLI on the same Spark cluster, we don't see the behavior
so it doesn't appear to be a Hive issue/behavior in this case- this is a spark 
behavior
I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark 
configuration already
The issue happens via Spark-submit as well - customer used the following 
command to reproduce this -
spark-submit test-hive-staging-cleanup.py

Solution: 
This bug is reported by customers.
The reason is the org.spark.sql.hive.InsertIntoHiveTable call the hive class of 
(org.apache.hadoop.hive.) to create the staging directory. Default, from the 
hive side, this staging file would be removed after the hive session is 
expired. However, spark fail to notify the hive to remove the staging files.
Thus, follow the code of spark 2.0.x, I just write one function inside the 
InsertIntoHiveTable to create the .staging directory, then, after the session 
expired of spark, this .staging directory would be removed.
This update is tested for the spark 1.5.2 and spark 1.6.3, and the push request 
is : 

For the test, I have manually checking .staging files from table belong 
directory after the spark shell close. meanwhile, please advise how to write 
the test case? because the directory for the related tables can not get.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org