[jira] [Commented] (SPARK-25501) Kafka delegation token support
[ https://issues.apache.org/jira/browse/SPARK-25501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630672#comment-16630672 ] Mingjie Tang commented on SPARK-25501: -- [~gsomogyi] thanks so much for your hard work, let me look at your SPIP, and give some comments > Kafka delegation token support > -- > > Key: SPARK-25501 > URL: https://issues.apache.org/jira/browse/SPARK-25501 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Gabor Somogyi >Priority: Major > > In kafka version 1.1 delegation token support is released. As spark updated > it's kafka client to 2.0.0 now it's possible to implement delegation token > support. Please see description: > https://cwiki.apache.org/confluence/display/KAFKA/KIP-48+Delegation+token+support+for+Kafka -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25501) Kafka delegation token support
[ https://issues.apache.org/jira/browse/SPARK-25501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629320#comment-16629320 ] Mingjie Tang commented on SPARK-25501: -- [~gsomogyi] Thanks for your reply. At first, what my PR proposed here is used for us to discuss, we can use it or disregard it. either way is ok for me. What I want to propose is that we can move this ticket asap, since this feature is critical for production and community. Second, You can build a document for discuss the design and have SPIP. I can learn advices from you and others. This would be useful. Finally, thanks so much for you to begin work on this work. Your example is very good. Therefore, you can refer my PR or do it by yourself, then, we can discuss and move forward this asap. What do you think about this? I hope to learn from you. > Kafka delegation token support > -- > > Key: SPARK-25501 > URL: https://issues.apache.org/jira/browse/SPARK-25501 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Gabor Somogyi >Priority: Major > > In kafka version 1.1 delegation token support is released. As spark updated > it's kafka client to 2.0.0 now it's possible to implement delegation token > support. Please see description: > https://cwiki.apache.org/confluence/display/KAFKA/KIP-48+Delegation+token+support+for+Kafka -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25501) Kafka delegation token support
[ https://issues.apache.org/jira/browse/SPARK-25501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628119#comment-16628119 ] Mingjie Tang commented on SPARK-25501: -- [~gsomogyi] if you like, you can propose a PR based my PR as well. https://github.com/apache/spark/pull/22550 > Kafka delegation token support > -- > > Key: SPARK-25501 > URL: https://issues.apache.org/jira/browse/SPARK-25501 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Gabor Somogyi >Priority: Major > > In kafka version 1.1 delegation token support is released. As spark updated > it's kafka client to 2.0.0 now it's possible to implement delegation token > support. Please see description: > https://cwiki.apache.org/confluence/display/KAFKA/KIP-48+Delegation+token+support+for+Kafka -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25501) Kafka delegation token support
[ https://issues.apache.org/jira/browse/SPARK-25501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16627934#comment-16627934 ] Mingjie Tang commented on SPARK-25501: -- [~gsomogyi] any updates on your PR, I can propose a PR as well. > Kafka delegation token support > -- > > Key: SPARK-25501 > URL: https://issues.apache.org/jira/browse/SPARK-25501 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Gabor Somogyi >Priority: Major > > In kafka version 1.1 delegation token support is released. As spark updated > it's kafka client to 2.0.0 now it's possible to implement delegation token > support. Please see description: > https://cwiki.apache.org/confluence/display/KAFKA/KIP-48+Delegation+token+support+for+Kafka -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25518) Spark kafka delegation token supported
Mingjie Tang created SPARK-25518: Summary: Spark kafka delegation token supported Key: SPARK-25518 URL: https://issues.apache.org/jira/browse/SPARK-25518 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.3.1 Reporter: Mingjie Tang As we can notice, Kafka is going to support delegation token[https://cwiki-test.apache.org/confluence/display/KAFKA/KIP-48+Delegation+token+support+for+Kafka]. Spark need to use the delegation token for Kafka cluster as delegation token for HDFS, Hive and HBase server. In this Jira, we are going to track how to provide the delegation token for Spark Streaming to read and write data from/to Kafka. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567393#comment-16567393 ] Mingjie Tang commented on SPARK-24615: -- >From user's perspective, user only concern about the GPU resource for RDD, and >do not understand the stage or partitions of RDD. Therefore, underline >resource allocation mechanism would assign the resources to executor >automatically. Similar as cache or persistence to different level, maybe we can provide different configuration to users. Then, resource allocation to follow the predefined policy to allocate resource. > Accelerator-aware task scheduling for Spark > --- > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Major > Labels: Hydrogen, SPIP > > In the machine learning area, accelerator card (GPU, FPGA, TPU) is > predominant compared to CPUs. To make the current Spark architecture to work > with accelerator cards, Spark itself should understand the existence of > accelerators and know how to schedule task onto the executors where > accelerators are equipped. > Current Spark’s scheduler schedules tasks based on the locality of the data > plus the available of CPUs. This will introduce some problems when scheduling > tasks with accelerators required. > # CPU cores are usually more than accelerators on one node, using CPU cores > to schedule accelerator required tasks will introduce the mismatch. > # In one cluster, we always assume that CPU is equipped in each node, but > this is not true of accelerator cards. > # The existence of heterogeneous tasks (accelerator required or not) > requires scheduler to schedule tasks with a smart way. > So here propose to improve the current scheduler to support heterogeneous > tasks (accelerator requires or not). This can be part of the work of Project > hydrogen. > Details is attached in google doc. It doesn't cover all the implementation > details, just highlight the parts should be changed. > > CC [~yanboliang] [~merlintang] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24374) SPIP: Support Barrier Scheduling in Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519622#comment-16519622 ] Mingjie Tang commented on SPARK-24374: -- [~mengxr]. [~jerryshao] create a SISP [SPARK-24615] for tasking scheduling for GPU resources. [~yanboliang] > SPIP: Support Barrier Scheduling in Apache Spark > > > Key: SPARK-24374 > URL: https://issues.apache.org/jira/browse/SPARK-24374 > Project: Spark > Issue Type: Epic > Components: ML, Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Labels: Hydrogen, SPIP > Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf > > > (See details in the linked/attached SPIP doc.) > {quote} > The proposal here is to add a new scheduling model to Apache Spark so users > can properly embed distributed DL training as a Spark stage to simplify the > distributed training workflow. For example, Horovod uses MPI to implement > all-reduce to accelerate distributed TensorFlow training. The computation > model is different from MapReduce used by Spark. In Spark, a task in a stage > doesn’t depend on any other tasks in the same stage, and hence it can be > scheduled independently. In MPI, all workers start at the same time and pass > messages around. To embed this workload in Spark, we need to introduce a new > scheduling model, tentatively named “barrier scheduling”, which launches > tasks at the same time and provides users enough information and tooling to > embed distributed DL training. Spark can also provide an extra layer of fault > tolerance in case some tasks failed in the middle, where Spark would abort > all tasks and restart the stage. > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24093) Make some fields of KafkaStreamWriter/InternalRowMicroBatchWriter visible to outside of the classes
[ https://issues.apache.org/jira/browse/SPARK-24093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16508814#comment-16508814 ] Mingjie Tang commented on SPARK-24093: -- The related topic could be viewed when users want to look at the kafka streaming topic information. For example, the third party job listener want to track this streaming job and record the topic information. > Make some fields of KafkaStreamWriter/InternalRowMicroBatchWriter visible to > outside of the classes > --- > > Key: SPARK-24093 > URL: https://issues.apache.org/jira/browse/SPARK-24093 > Project: Spark > Issue Type: Wish > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Weiqing Yang >Priority: Minor > > To make third parties able to get the information of streaming writer, for > example, the information of "writer" and "topic" which streaming data are > written into, this jira is created to make relevant fields of > KafkaStreamWriter and InternalRowMicroBatchWriter visible to outside of the > classes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24479) Register StreamingQueryListener in Spark Conf
Mingjie Tang created SPARK-24479: Summary: Register StreamingQueryListener in Spark Conf Key: SPARK-24479 URL: https://issues.apache.org/jira/browse/SPARK-24479 Project: Spark Issue Type: New Feature Components: Structured Streaming Affects Versions: 2.3.0 Reporter: Mingjie Tang Users need to register their own StreamingQueryListener into StreamingQueryManager, the similar function is provided as EXTRA_LISTENERS and QUERY_EXECUTION_LISTENERS. We propose to provide STREAMING_QUERY_LISTENER Conf for user to register their own listener. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24093) Make some fields of KafkaStreamWriter/InternalRowMicroBatchWriter visible to outside of the classes
[ https://issues.apache.org/jira/browse/SPARK-24093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494154#comment-16494154 ] Mingjie Tang commented on SPARK-24093: -- I made a PR for this https://github.com/apache/spark/pull/21455 > Make some fields of KafkaStreamWriter/InternalRowMicroBatchWriter visible to > outside of the classes > --- > > Key: SPARK-24093 > URL: https://issues.apache.org/jira/browse/SPARK-24093 > Project: Spark > Issue Type: Wish > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Weiqing Yang >Priority: Minor > > To make third parties able to get the information of streaming writer, for > example, the information of "writer" and "topic" which streaming data are > written into, this jira is created to make relevant fields of > KafkaStreamWriter and InternalRowMicroBatchWriter visible to outside of the > classes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24093) Make some fields of KafkaStreamWriter/InternalRowMicroBatchWriter visible to outside of the classes
[ https://issues.apache.org/jira/browse/SPARK-24093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491285#comment-16491285 ] Mingjie Tang commented on SPARK-24093: -- i can add a PR for this. > Make some fields of KafkaStreamWriter/InternalRowMicroBatchWriter visible to > outside of the classes > --- > > Key: SPARK-24093 > URL: https://issues.apache.org/jira/browse/SPARK-24093 > Project: Spark > Issue Type: Wish > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Weiqing Yang >Priority: Minor > > To make third parties able to get the information of streaming writer, for > example, the information of "writer" and "topic" which streaming data are > written into, this jira is created to make relevant fields of > KafkaStreamWriter and InternalRowMicroBatchWriter visible to outside of the > classes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23674) Add Spark ML Listener for Tracking ML Pipeline Status
Mingjie Tang created SPARK-23674: Summary: Add Spark ML Listener for Tracking ML Pipeline Status Key: SPARK-23674 URL: https://issues.apache.org/jira/browse/SPARK-23674 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.3.0 Reporter: Mingjie Tang Currently, Spark provides status monitoring for different components of Spark, like spark history server, streaming listener, sql listener and etc. The use case would be (1) front UI to track the status of training coverage rate during iteration, then DS can understand how the job converge when training, like K-means, Logistic and other linear regression model. (2) tracking the data lineage for the input and output of training data. In this proposal, we hope to provide Spark ML pipeline listener to track the status of Spark ML pipeline status includes: # ML pipeline create and saved # ML pipeline mode created, saved and load # ML model training status monitoring -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18673) Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
[ https://issues.apache.org/jira/browse/SPARK-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16284485#comment-16284485 ] Mingjie Tang commented on SPARK-18673: -- Hi Steve, the hive side is updated. https://github.com/apache/hive/blob/master/shims/common/src/main/java/org/apache/hadoop/hive/shims/ShimLoader.java#L144. thus, if we build w.r.t new hive, this unrecognized issue can be fixed. > Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version > -- > > Key: SPARK-18673 > URL: https://issues.apache.org/jira/browse/SPARK-18673 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: Spark built with -Dhadoop.version=3.0.0-alpha2-SNAPSHOT >Reporter: Steve Loughran > > Spark Dataframes fail to run on Hadoop 3.0.x, because hive.jar's shimloader > considers 3.x to be an unknown Hadoop version. > Hive itself will have to fix this; as Spark uses its own hive 1.2.x JAR, it > will need to be updated to match. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22587) Spark job fails if fs.defaultFS and application jar are different url
[ https://issues.apache.org/jira/browse/SPARK-22587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16273745#comment-16273745 ] Mingjie Tang edited comment on SPARK-22587 at 12/5/17 12:01 AM: we can update the compareFS by considering the authority. https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L1442 The PR is sent out. https://github.com/apache/spark/pull/19885 was (Author: merlin): we can update the compareFS by considering the authority. https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L1442 I would send out a PR soon. > Spark job fails if fs.defaultFS and application jar are different url > - > > Key: SPARK-22587 > URL: https://issues.apache.org/jira/browse/SPARK-22587 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.6.3 >Reporter: Prabhu Joseph > > Spark Job fails if the fs.defaultFs and url where application jar resides are > different and having same scheme, > spark-submit --conf spark.master=yarn-cluster wasb://XXX/tmp/test.py > core-site.xml fs.defaultFS is set to wasb:///YYY. Hadoop list works (hadoop > fs -ls) works for both the url XXX and YYY. > {code} > Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: > wasb://XXX/tmp/test.py, expected: wasb://YYY > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:665) > at > org.apache.hadoop.fs.azure.NativeAzureFileSystem.checkPath(NativeAzureFileSystem.java:1251) > > at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:485) > at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:396) > at > org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:507) > > at > org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:660) > at > org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:912) > > at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:172) > at org.apache.spark.deploy.yarn.Client.run(Client.scala:1248) > at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1307) > at org.apache.spark.deploy.yarn.Client.main(Client.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:751) > > at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} > The code Client.copyFileToRemote tries to resolve the path of application jar > (XXX) from the FileSystem object created using fs.defaultFS url (YYY) instead > of the actual url of application jar. > val destFs = destDir.getFileSystem(hadoopConf) > val srcFs = srcPath.getFileSystem(hadoopConf) > getFileSystem will create the filesystem based on the url of the path and so > this is fine. But the below lines of code tries to get the srcPath (XXX url) > from the destFs (YYY url) and so it fails. > var destPath = srcPath > val qualifiedDestPath = destFs.makeQualified(destPath) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22587) Spark job fails if fs.defaultFS and application jar are different url
[ https://issues.apache.org/jira/browse/SPARK-22587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16273745#comment-16273745 ] Mingjie Tang commented on SPARK-22587: -- we can update the compareFS by considering the authority. https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L1442 I would send out a PR soon. > Spark job fails if fs.defaultFS and application jar are different url > - > > Key: SPARK-22587 > URL: https://issues.apache.org/jira/browse/SPARK-22587 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.6.3 >Reporter: Prabhu Joseph > > Spark Job fails if the fs.defaultFs and url where application jar resides are > different and having same scheme, > spark-submit --conf spark.master=yarn-cluster wasb://XXX/tmp/test.py > core-site.xml fs.defaultFS is set to wasb:///YYY. Hadoop list works (hadoop > fs -ls) works for both the url XXX and YYY. > {code} > Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: > wasb://XXX/tmp/test.py, expected: wasb://YYY > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:665) > at > org.apache.hadoop.fs.azure.NativeAzureFileSystem.checkPath(NativeAzureFileSystem.java:1251) > > at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:485) > at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:396) > at > org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:507) > > at > org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:660) > at > org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:912) > > at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:172) > at org.apache.spark.deploy.yarn.Client.run(Client.scala:1248) > at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1307) > at org.apache.spark.deploy.yarn.Client.main(Client.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:751) > > at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} > The code Client.copyFileToRemote tries to resolve the path of application jar > (XXX) from the FileSystem object created using fs.defaultFS url (YYY) instead > of the actual url of application jar. > val destFs = destDir.getFileSystem(hadoopConf) > val srcFs = srcPath.getFileSystem(hadoopConf) > getFileSystem will create the filesystem based on the url of the path and so > this is fine. But the below lines of code tries to get the srcPath (XXX url) > from the destFs (YYY url) and so it fails. > var destPath = srcPath > val qualifiedDestPath = destFs.makeQualified(destPath) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19771) Support OR-AND amplification in Locality Sensitive Hashing (LSH)
[ https://issues.apache.org/jira/browse/SPARK-19771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15893117#comment-15893117 ] Mingjie Tang commented on SPARK-19771: -- (1) because you need to explode each tuple. For example mentioned above, for one input tuple, you have to build 3 rows, and each hashvalue contain a vector is the length of hash functions. thus, for one tuple, your memory overhead is NumHashFunctions*NumHashTables=15. Thus, if the number input tuple is N, the overhead is NumHashFunctions*NumHashTables*N. (2) yes, the hashvalue can be any based on your input bucketwidth W. Actually, it should be very big for less collision. (3) I am not sure the hashCode can work, because we need to use this function for multi-probe searching. > Support OR-AND amplification in Locality Sensitive Hashing (LSH) > > > Key: SPARK-19771 > URL: https://issues.apache.org/jira/browse/SPARK-19771 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0 >Reporter: Yun Ni > > The current LSH implementation only supports AND-OR amplification. We need to > discuss the following questions before we goes to implementations: > (1) Whether we should support OR-AND amplification > (2) What API changes we need for OR-AND amplification > (3) How we fix the approxNearestNeighbor and approxSimilarityJoin internally. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18454) Changes to improve Nearest Neighbor Search for LSH
[ https://issues.apache.org/jira/browse/SPARK-18454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15893087#comment-15893087 ] Mingjie Tang commented on SPARK-18454: -- [~yunn] the current multi-probe NNS can be improved without building index. > Changes to improve Nearest Neighbor Search for LSH > -- > > Key: SPARK-18454 > URL: https://issues.apache.org/jira/browse/SPARK-18454 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yun Ni > > We all agree to do the following improvement to Multi-Probe NN Search: > (1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing > full sort on the whole dataset > Currently we are still discussing the following: > (1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}} > (2) What are the issues and how we should change the current Nearest Neighbor > implementation -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19771) Support OR-AND amplification in Locality Sensitive Hashing (LSH)
[ https://issues.apache.org/jira/browse/SPARK-19771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889180#comment-15889180 ] Mingjie Tang edited comment on SPARK-19771 at 3/1/17 12:30 AM: --- If we follow the AND-OR framework, one optimization is here: At first, suppose we use BucketedRandomProjectionLSH, and setNumHashTables(3) and setNumHashFunctions(5). For one input vector with sparse format e.g., [(6,[0,1,2],[1.0,1.0,1.0])], we can compute its mapped hash vectors is WrappedArray([-1.0,-1.0,-1.0,0.0,0.0], [-1.0,0.0,0.0,-1.0,-1.0], [-1.0,-1.0,-1.0,-1.0,0.0])] For the similarity-join, we map this computed vector into ++-++ |datasetA|entry| hashValue| ++-++ |[0,(6,[0,1,2],[1|0|[-1.0,-1.0,-1.0,0...| |[0,(6,[0,1,2],[1|1|[-1.0,0.0,0.0,-1| |[0,(6,[0,1,2],[1|2|[-1.0,-1.0,-1.0,-...| Then, based on AND-OR principle, we using the entry and hashValue for equip-join with other table . When we look at the the sql, it need to iterate through the hashValue vector for equal join. Thus, this computation and memory usage cost is very high. For example, if we apply the nest-loop for comparing two vectors with length of NumHashFunctions, the computation cost is (NumHashFunctions* NumHashFunctions) and memory overhead is (N* NumHashFunctions). Therefore, we can apply one optimization technique here. that is, we do not need to store the hash value for each hash table like [-1.0,-1.0,-1.0,0.0,0.0], instead, we use a simple hash function to improve this. Suppose h_l= {h_0, h_1., ... h_k}, where k is the number of setNumHashFunctions. then, the mapped hash function is H(h_l)=sum_{i =0...k} (h_i*r_i) mod prime. where r_i is a integer and the prime can be set as 2^32-5 for less hash collision. Then, we only need to store the hash value H(h_l) for AND operation. The similar approach also can be applied for the approxNeasetNeighbors searching. If the multi-probe approach does not need to read the hash value for one hash function (e.g., h_l mentioned above), we can applied this H(h_l) to the preprocessed data to save memory. I will double check the multi-probe approach and see whether it is work. Then, I can submit a patch to improve the current way. Note, this hash function to reduce the vector-to-vector comparing is widely used in different places for example, the LSH implementation by Andoni . [~yunn] [~mlnick] was (Author: merlintang): If we follow the AND-OR framework, one optimization is here: At first, suppose we use BucketedRandomProjectionLSH, and setNumHashTables(3) and setNumHashFunctions(5). For one input vector with sparse format e.g., [(6,[0,1,2],[1.0,1.0,1.0])], we can compute its mapped hash vectors is WrappedArray([-1.0,-1.0,-1.0,0.0,0.0], [-1.0,0.0,0.0,-1.0,-1.0], [-1.0,-1.0,-1.0,-1.0,0.0])] For the similarity-join, we map this computed vector into ++-++ |datasetA|entry| hashValue| ++-++ |[0,(6,[0,1,2],[1|0|[-1.0,-1.0,-1.0,0...| |[0,(6,[0,1,2],[1|1|[-1.0,0.0,0.0,-1| |[0,(6,[0,1,2],[1|2|[-1.0,-1.0,-1.0,-...| Then, based on AND-OR principle, we using the entry and hashValue for equip-join with other table . When we look at the the sql, it need to iterate through the hashValue vector for equal join. Thus, this computation and memory usage cost is very high. For example, if we apply the nest-loop for comparing two vectors with length of NumHashFunctions, the computation cost is (NumHashFunctions* NumHashFunctions) and memory overhead is (N* NumHashFunctions). Therefore, we can apply one optimization technique here. that is, we do not need to store the hash value for each hash table like [-1.0,-1.0,-1.0,0.0,0.0], instead, we use a simple hash function to improve this. Suppose h_l= {h_0, h_1., ... h_k}, where k is the number of setNumHashFunctions. then, the mapped hash function is H(h_l)=sum_{i =0...k} (h_i*r_i) mod prime. where r_i is a integer and the prime can be set as 2^32-5 for less hash collision. Then, we only need to store the hash value H(h_l) for AND operation. The similar approach also can be applied for the approxNeasetNeighbors searching. If the multi-probe approach does not need to read the hash value for one hash function (e.g., h_l mentioned above), we can applied this H(h_l) to the preprocessed data to save memory. I will double check the multi-probe approach and see whether it is work. Then, I can submit a patch to improve the current way. Note, this hash function to reduce the vector-to-vector comparing is widely used in different places for example, the LSH implementation by Andoni . [~diefunction] [~mlnick] > Support OR-AND
[jira] [Commented] (SPARK-19771) Support OR-AND amplification in Locality Sensitive Hashing (LSH)
[ https://issues.apache.org/jira/browse/SPARK-19771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889180#comment-15889180 ] Mingjie Tang commented on SPARK-19771: -- If we follow the AND-OR framework, one optimization is here: At first, suppose we use BucketedRandomProjectionLSH, and setNumHashTables(3) and setNumHashFunctions(5). For one input vector with sparse format e.g., [(6,[0,1,2],[1.0,1.0,1.0])], we can compute its mapped hash vectors is WrappedArray([-1.0,-1.0,-1.0,0.0,0.0], [-1.0,0.0,0.0,-1.0,-1.0], [-1.0,-1.0,-1.0,-1.0,0.0])] For the similarity-join, we map this computed vector into ++-++ |datasetA|entry| hashValue| ++-++ |[0,(6,[0,1,2],[1|0|[-1.0,-1.0,-1.0,0...| |[0,(6,[0,1,2],[1|1|[-1.0,0.0,0.0,-1| |[0,(6,[0,1,2],[1|2|[-1.0,-1.0,-1.0,-...| Then, based on AND-OR principle, we using the entry and hashValue for equip-join with other table . When we look at the the sql, it need to iterate through the hashValue vector for equal join. Thus, this computation and memory usage cost is very high. For example, if we apply the nest-loop for comparing two vectors with length of NumHashFunctions, the computation cost is (NumHashFunctions* NumHashFunctions) and memory overhead is (N* NumHashFunctions). Therefore, we can apply one optimization technique here. that is, we do not need to store the hash value for each hash table like [-1.0,-1.0,-1.0,0.0,0.0], instead, we use a simple hash function to improve this. Suppose h_l= {h_0, h_1., ... h_k}, where k is the number of setNumHashFunctions. then, the mapped hash function is H(h_l)=sum_{i =0...k} (h_i*r_i) mod prime. where r_i is a integer and the prime can be set as 2^32-5 for less hash collision. Then, we only need to store the hash value H(h_l) for AND operation. The similar approach also can be applied for the approxNeasetNeighbors searching. If the multi-probe approach does not need to read the hash value for one hash function (e.g., h_l mentioned above), we can applied this H(h_l) to the preprocessed data to save memory. I will double check the multi-probe approach and see whether it is work. Then, I can submit a patch to improve the current way. Note, this hash function to reduce the vector-to-vector comparing is widely used in different places for example, the LSH implementation by Andoni . [~diefunction] [~mlnick] > Support OR-AND amplification in Locality Sensitive Hashing (LSH) > > > Key: SPARK-19771 > URL: https://issues.apache.org/jira/browse/SPARK-19771 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0 >Reporter: Yun Ni > > The current LSH implementation only supports AND-OR amplification. We need to > discuss the following questions before we goes to implementations: > (1) Whether we should support OR-AND amplification > (2) What API changes we need for OR-AND amplification > (3) How we fix the approxNearestNeighbor and approxSimilarityJoin internally. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18454) Changes to improve Nearest Neighbor Search for LSH
[ https://issues.apache.org/jira/browse/SPARK-18454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879539#comment-15879539 ] Mingjie Tang commented on SPARK-18454: -- [~yunn] I leave some comments on the document. the build index over the input data would be very useful, if we do not shuffle the input data table. > Changes to improve Nearest Neighbor Search for LSH > -- > > Key: SPARK-18454 > URL: https://issues.apache.org/jira/browse/SPARK-18454 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yun Ni > > We all agree to do the following improvement to Multi-Probe NN Search: > (1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing > full sort on the whole dataset > Currently we are still discussing the following: > (1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}} > (2) What are the issues and how we should change the current Nearest Neighbor > implementation -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18392) LSH API, algorithm, and documentation follow-ups
[ https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867064#comment-15867064 ] Mingjie Tang commented on SPARK-18392: -- Sure, AND-amp is important and basic for current LSH. We can allocate the efforts for AND-amplification and new hash functions. \cc [~yunn] > LSH API, algorithm, and documentation follow-ups > > > Key: SPARK-18392 > URL: https://issues.apache.org/jira/browse/SPARK-18392 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > This JIRA summarizes discussions from the initial LSH PR > [https://github.com/apache/spark/pull/15148] as well as the follow-up for > hash distance [https://github.com/apache/spark/pull/15800]. This will be > broken into subtasks: > * API changes (targeted for 2.1) > * algorithmic fixes (targeted for 2.1) > * documentation improvements (ideally 2.1, but could slip) > The major issues we have mentioned are as follows: > * OR vs AND amplification > ** Need to make API flexible enough to support both types of amplification in > the future > ** Need to clarify which we support, including in each model function > (transform, similarity join, neighbors) > * Need to clarify which algorithms we have implemented, improve docs and > references, and fix the algorithms if needed. > These major issues are broken down into detailed issues below. > h3. LSH abstraction > * Rename {{outputDim}} to something indicative of OR-amplification. > ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used > in the future for AND amplification (Thanks [~mlnick]!) > * transform > ** Update output schema to {{Array of Vector}} instead of {{Vector}}. This > is the "raw" output of all hash functions, i.e., with no aggregation for > amplification. > ** Clarify meaning of output in terms of multiple hash functions and > amplification. > ** Note: We will _not_ worry about users using this output for dimensionality > reduction; if anything, that use case can be explained in the User Guide. > * Documentation > ** Clarify terminology used everywhere > *** hash function {{h_i}}: basic hash function without amplification > *** hash value {{h_i(key)}}: output of a hash function > *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with > AND-amplification using K base hash functions > *** compound hash function value {{g(key)}}: vector-valued output > *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with > OR-amplification using L compound hash functions > *** hash table value {{H(key)}}: output of array of vectors > *** This terminology is largely pulled from Wang et al.'s survey and the > multi-probe LSH paper. > ** Link clearly to documentation (Wikipedia or papers) which matches our > terminology and what we implemented > h3. RandomProjection (or P-Stable Distributions) > * Rename {{RandomProjection}} > ** Options include: {{ScalarRandomProjectionLSH}}, > {{BucketedRandomProjectionLSH}}, {{PStableLSH}} > * API privacy > ** Make randUnitVectors private > * hashFunction > ** Currently, this uses OR-amplification for single probing, as we intended. > ** It does *not* do multiple probing, at least not in the sense of the > original MPLSH paper. We should fix that or at least document its behavior. > * Documentation > ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia > ** Also link to the multi-probe LSH paper since that explains this method > very clearly. > ** Clarify hash function and distance metric > h3. MinHash > * Rename {{MinHash}} -> {{MinHashLSH}} > * API privacy > ** Make randCoefficients, numEntries private > * hashDistance (used in approxNearestNeighbors) > ** Update to use average of indicators of hash collisions [SPARK-18334] > ** See [Wikipedia | > https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a > reference > h3. All references > I'm just listing references I looked at. > Papers > * [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf] > * [https://people.csail.mit.edu/indyk/p117-andoni.pdf] > * [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf] > * [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe > LSH paper > Wikipedia > * > [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search] > * [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18392) LSH API, algorithm, and documentation follow-ups
[ https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867041#comment-15867041 ] mingjie tang commented on SPARK-18392: -- [~yunn] are you working on the BitSampling & SignRandomProjection function, if not, I can work on them this week. > LSH API, algorithm, and documentation follow-ups > > > Key: SPARK-18392 > URL: https://issues.apache.org/jira/browse/SPARK-18392 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > This JIRA summarizes discussions from the initial LSH PR > [https://github.com/apache/spark/pull/15148] as well as the follow-up for > hash distance [https://github.com/apache/spark/pull/15800]. This will be > broken into subtasks: > * API changes (targeted for 2.1) > * algorithmic fixes (targeted for 2.1) > * documentation improvements (ideally 2.1, but could slip) > The major issues we have mentioned are as follows: > * OR vs AND amplification > ** Need to make API flexible enough to support both types of amplification in > the future > ** Need to clarify which we support, including in each model function > (transform, similarity join, neighbors) > * Need to clarify which algorithms we have implemented, improve docs and > references, and fix the algorithms if needed. > These major issues are broken down into detailed issues below. > h3. LSH abstraction > * Rename {{outputDim}} to something indicative of OR-amplification. > ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used > in the future for AND amplification (Thanks [~mlnick]!) > * transform > ** Update output schema to {{Array of Vector}} instead of {{Vector}}. This > is the "raw" output of all hash functions, i.e., with no aggregation for > amplification. > ** Clarify meaning of output in terms of multiple hash functions and > amplification. > ** Note: We will _not_ worry about users using this output for dimensionality > reduction; if anything, that use case can be explained in the User Guide. > * Documentation > ** Clarify terminology used everywhere > *** hash function {{h_i}}: basic hash function without amplification > *** hash value {{h_i(key)}}: output of a hash function > *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with > AND-amplification using K base hash functions > *** compound hash function value {{g(key)}}: vector-valued output > *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with > OR-amplification using L compound hash functions > *** hash table value {{H(key)}}: output of array of vectors > *** This terminology is largely pulled from Wang et al.'s survey and the > multi-probe LSH paper. > ** Link clearly to documentation (Wikipedia or papers) which matches our > terminology and what we implemented > h3. RandomProjection (or P-Stable Distributions) > * Rename {{RandomProjection}} > ** Options include: {{ScalarRandomProjectionLSH}}, > {{BucketedRandomProjectionLSH}}, {{PStableLSH}} > * API privacy > ** Make randUnitVectors private > * hashFunction > ** Currently, this uses OR-amplification for single probing, as we intended. > ** It does *not* do multiple probing, at least not in the sense of the > original MPLSH paper. We should fix that or at least document its behavior. > * Documentation > ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia > ** Also link to the multi-probe LSH paper since that explains this method > very clearly. > ** Clarify hash function and distance metric > h3. MinHash > * Rename {{MinHash}} -> {{MinHashLSH}} > * API privacy > ** Make randCoefficients, numEntries private > * hashDistance (used in approxNearestNeighbors) > ** Update to use average of indicators of hash collisions [SPARK-18334] > ** See [Wikipedia | > https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a > reference > h3. All references > I'm just listing references I looked at. > Papers > * [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf] > * [https://people.csail.mit.edu/indyk/p117-andoni.pdf] > * [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf] > * [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe > LSH paper > Wikipedia > * > [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search] > * [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up
[ https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mingjie tang updated SPARK-18372: - Fix Version/s: 2.0.2 > .Hive-staging folders created from Spark hiveContext are not getting cleaned > up > --- > > Key: SPARK-18372 > URL: https://issues.apache.org/jira/browse/SPARK-18372 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.2, 1.6.3 > Environment: spark standalone and spark yarn >Reporter: mingjie tang > Fix For: 2.0.2 > > Attachments: _thumb_37664.png > > > Steps to reproduce: > > 1. Launch spark-shell > 2. Run the following scala code via Spark-Shell > scala> val hivesampletabledf = sqlContext.table("hivesampletable") > scala> import org.apache.spark.sql.DataFrameWriter > scala> val dfw : DataFrameWriter = hivesampletabledf.write > scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( > clientid string, querytime string, market string, deviceplatform string, > devicemake string, devicemodel string, state string, country string, > querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") > scala> dfw.insertInto("hivesampletablecopypy") > scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, > querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE > state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """) > hivesampletablecopypydfdf.show > 3. in HDFS (in our case, WASB), we can see the following folders > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666 > > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1 > > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693 > the issue is that these don't get cleaned up and get accumulated > = > with the customer, we have tried setting "SET > hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any > difference. > .hive-staging folders are created under the folder - > hive/warehouse/hivesampletablecopypy/ > we have tried adding this property to hive-site.xml and restart the > components - > > hive.exec.stagingdir > $ {hive.exec.scratchdir} > /$ > {user.name} > /.staging > > a new .hive-staging folder was created in hive/warehouse/ folder > moreover, please understand that if we run the hive query in pure Hive via > Hive CLI on the same Spark cluster, we don't see the behavior > so it doesn't appear to be a Hive issue/behavior in this case- this is a > spark behavior > I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark > configuration already > The issue happens via Spark-submit as well - customer used the following > command to reproduce this - > spark-submit test-hive-staging-cleanup.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up
[ https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mingjie tang updated SPARK-18372: - Attachment: _thumb_37664.png the staging directory fail to be removed when hive table in the hdfs. > .Hive-staging folders created from Spark hiveContext are not getting cleaned > up > --- > > Key: SPARK-18372 > URL: https://issues.apache.org/jira/browse/SPARK-18372 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.2, 1.6.3 > Environment: spark standalone and spark yarn >Reporter: mingjie tang > Fix For: 2.0.1 > > Attachments: _thumb_37664.png > > > Steps to reproduce: > > 1. Launch spark-shell > 2. Run the following scala code via Spark-Shell > scala> val hivesampletabledf = sqlContext.table("hivesampletable") > scala> import org.apache.spark.sql.DataFrameWriter > scala> val dfw : DataFrameWriter = hivesampletabledf.write > scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( > clientid string, querytime string, market string, deviceplatform string, > devicemake string, devicemodel string, state string, country string, > querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") > scala> dfw.insertInto("hivesampletablecopypy") > scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, > querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE > state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """) > hivesampletablecopypydfdf.show > 3. in HDFS (in our case, WASB), we can see the following folders > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666 > > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1 > > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693 > the issue is that these don't get cleaned up and get accumulated > = > with the customer, we have tried setting "SET > hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any > difference. > .hive-staging folders are created under the folder - > hive/warehouse/hivesampletablecopypy/ > we have tried adding this property to hive-site.xml and restart the > components - > > hive.exec.stagingdir > $ {hive.exec.scratchdir} > /$ > {user.name} > /.staging > > a new .hive-staging folder was created in hive/warehouse/ folder > moreover, please understand that if we run the hive query in pure Hive via > Hive CLI on the same Spark cluster, we don't see the behavior > so it doesn't appear to be a Hive issue/behavior in this case- this is a > spark behavior > I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark > configuration already > The issue happens via Spark-submit as well - customer used the following > command to reproduce this - > spark-submit test-hive-staging-cleanup.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up
[ https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649935#comment-15649935 ] mingjie tang commented on SPARK-18372: -- This bug can be reproduced by the following codes: val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql("CREATE TABLE IF NOT EXISTS T1 (key INT, value STRING)") sqlContext.sql("LOAD DATA LOCAL INPATH '../examples/src/main/resources/kv1.txt' INTO TABLE T1") sqlContext.sql("CREATE TABLE IF NOT EXISTS T2 (key INT, value STRING)") val sparktestdf = sqlContext.table("T1") val dfw = sparktestdf.write dfw.insertInto("T2") val sparktestcopypydfdf = sqlContext.sql("""SELECT * from T2 """) sparktestcopypydfdf.show After user quit the spark-shell, the related .staging directory generated by hive writer will not be removed. For example: the hive table in the directory: /user/hive/warehouse/t2 drwxr-xrwx 3 root wheel 102 Oct 28 15:02 .hive-staging_hive_2016-10-28_15-02-43_288_7070526396398178792-1 > .Hive-staging folders created from Spark hiveContext are not getting cleaned > up > --- > > Key: SPARK-18372 > URL: https://issues.apache.org/jira/browse/SPARK-18372 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.2, 1.6.3 > Environment: spark standalone and spark yarn >Reporter: mingjie tang > Fix For: 2.0.1 > > > Steps to reproduce: > > 1. Launch spark-shell > 2. Run the following scala code via Spark-Shell > scala> val hivesampletabledf = sqlContext.table("hivesampletable") > scala> import org.apache.spark.sql.DataFrameWriter > scala> val dfw : DataFrameWriter = hivesampletabledf.write > scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( > clientid string, querytime string, market string, deviceplatform string, > devicemake string, devicemodel string, state string, country string, > querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") > scala> dfw.insertInto("hivesampletablecopypy") > scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, > querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE > state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """) > hivesampletablecopypydfdf.show > 3. in HDFS (in our case, WASB), we can see the following folders > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666 > > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1 > > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693 > the issue is that these don't get cleaned up and get accumulated > = > with the customer, we have tried setting "SET > hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any > difference. > .hive-staging folders are created under the folder - > hive/warehouse/hivesampletablecopypy/ > we have tried adding this property to hive-site.xml and restart the > components - > > hive.exec.stagingdir > $ {hive.exec.scratchdir} > /$ > {user.name} > /.staging > > a new .hive-staging folder was created in hive/warehouse/ folder > moreover, please understand that if we run the hive query in pure Hive via > Hive CLI on the same Spark cluster, we don't see the behavior > so it doesn't appear to be a Hive issue/behavior in this case- this is a > spark behavior > I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark > configuration already > The issue happens via Spark-submit as well - customer used the following > command to reproduce this - > spark-submit test-hive-staging-cleanup.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up
[ https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649397#comment-15649397 ] mingjie tang commented on SPARK-18372: -- Solution: This bug is reported by customers. The reason is the org.spark.sql.hive.InsertIntoHiveTable call the hive class of (org.apache.hadoop.hive.) to create the staging directory. Default, from the hive side, this staging file would be removed after the hive session is expired. However, spark fail to notify the hive to remove the staging files. Thus, follow the code of spark 2.0.x, I just write one function inside the InsertIntoHiveTable to create the .staging directory, then, after the session expired of spark, this .staging directory would be removed. This update is tested for the spark 1.5.2 and spark 1.6.3, and the push request is : For the test, I have manually checking .staging files from table belong directory after the spark shell close. > .Hive-staging folders created from Spark hiveContext are not getting cleaned > up > --- > > Key: SPARK-18372 > URL: https://issues.apache.org/jira/browse/SPARK-18372 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.2, 1.6.3 > Environment: spark standalone and spark yarn >Reporter: mingjie tang > Fix For: 2.0.1 > > > Steps to reproduce: > > 1. Launch spark-shell > 2. Run the following scala code via Spark-Shell > scala> val hivesampletabledf = sqlContext.table("hivesampletable") > scala> import org.apache.spark.sql.DataFrameWriter > scala> val dfw : DataFrameWriter = hivesampletabledf.write > scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( > clientid string, querytime string, market string, deviceplatform string, > devicemake string, devicemodel string, state string, country string, > querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") > scala> dfw.insertInto("hivesampletablecopypy") > scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, > querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE > state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """) > hivesampletablecopypydfdf.show > 3. in HDFS (in our case, WASB), we can see the following folders > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666 > > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1 > > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693 > the issue is that these don't get cleaned up and get accumulated > = > with the customer, we have tried setting "SET > hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any > difference. > .hive-staging folders are created under the folder - > hive/warehouse/hivesampletablecopypy/ > we have tried adding this property to hive-site.xml and restart the > components - > > hive.exec.stagingdir > $ {hive.exec.scratchdir} > /$ > {user.name} > /.staging > > a new .hive-staging folder was created in hive/warehouse/ folder > moreover, please understand that if we run the hive query in pure Hive via > Hive CLI on the same Spark cluster, we don't see the behavior > so it doesn't appear to be a Hive issue/behavior in this case- this is a > spark behavior > I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark > configuration already > The issue happens via Spark-submit as well - customer used the following > command to reproduce this - > spark-submit test-hive-staging-cleanup.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up
[ https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mingjie tang updated SPARK-18372: - Description: Steps to reproduce: 1. Launch spark-shell 2. Run the following scala code via Spark-Shell scala> val hivesampletabledf = sqlContext.table("hivesampletable") scala> import org.apache.spark.sql.DataFrameWriter scala> val dfw : DataFrameWriter = hivesampletabledf.write scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( clientid string, querytime string, market string, deviceplatform string, devicemake string, devicemodel string, state string, country string, querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") scala> dfw.insertInto("hivesampletablecopypy") scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """) hivesampletablecopypydfdf.show 3. in HDFS (in our case, WASB), we can see the following folders hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666 hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1 hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693 the issue is that these don't get cleaned up and get accumulated = with the customer, we have tried setting "SET hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any difference. .hive-staging folders are created under the folder - hive/warehouse/hivesampletablecopypy/ we have tried adding this property to hive-site.xml and restart the components - hive.exec.stagingdir $ {hive.exec.scratchdir} /$ {user.name} /.staging a new .hive-staging folder was created in hive/warehouse/ folder moreover, please understand that if we run the hive query in pure Hive via Hive CLI on the same Spark cluster, we don't see the behavior so it doesn't appear to be a Hive issue/behavior in this case- this is a spark behavior I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark configuration already The issue happens via Spark-submit as well - customer used the following command to reproduce this - spark-submit test-hive-staging-cleanup.py was: Steps to reproduce: 1. Launch spark-shell 2. Run the following scala code via Spark-Shell scala> val hivesampletabledf = sqlContext.table("hivesampletable") scala> import org.apache.spark.sql.DataFrameWriter scala> val dfw : DataFrameWriter = hivesampletabledf.write scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( clientid string, querytime string, market string, deviceplatform string, devicemake string, devicemodel string, state string, country string, querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") scala> dfw.insertInto("hivesampletablecopypy") scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """) hivesampletablecopypydfdf.show 3. in HDFS (in our case, WASB), we can see the following folders hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666 hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1 hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693 the issue is that these don't get cleaned up and get accumulated = with the customer, we have tried setting "SET hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any difference. .hive-staging folders are created under the folder - hive/warehouse/hivesampletablecopypy/ we have tried adding this property to hive-site.xml and restart the components - hive.exec.stagingdir $ {hive.exec.scratchdir} /$ {user.name} /.staging a new .hive-staging folder was created in hive/warehouse/ folder moreover, please understand that if we run the hive query in pure Hive via Hive CLI on the same Spark cluster, we don't see the behavior so it doesn't appear to be a Hive issue/behavior in this case- this is a spark behavior I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark configuration already The issue happens via Spark-submit as well - customer used the following command to reproduce this - spark-submit test-hive-staging-cleanup.py Solution: This bug is reported by customers. The reason is the org.spark.sql.hive.InsertIntoHiveTable call the hive class of (org.apache.hadoop.hive.) to create the staging directory. Default, from the hive side, this staging file would be removed after the hive session is
[jira] [Commented] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up
[ https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649386#comment-15649386 ] mingjie tang commented on SPARK-18372: -- the PR is https://github.com/apache/spark/pull/15819 > .Hive-staging folders created from Spark hiveContext are not getting cleaned > up > --- > > Key: SPARK-18372 > URL: https://issues.apache.org/jira/browse/SPARK-18372 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.2, 1.6.3 > Environment: spark standalone and spark yarn >Reporter: mingjie tang > Fix For: 2.0.1 > > > Steps to reproduce: > > 1. Launch spark-shell > 2. Run the following scala code via Spark-Shell > scala> val hivesampletabledf = sqlContext.table("hivesampletable") > scala> import org.apache.spark.sql.DataFrameWriter > scala> val dfw : DataFrameWriter = hivesampletabledf.write > scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( > clientid string, querytime string, market string, deviceplatform string, > devicemake string, devicemodel string, state string, country string, > querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") > scala> dfw.insertInto("hivesampletablecopypy") > scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, > querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE > state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """) > hivesampletablecopypydfdf.show > 3. in HDFS (in our case, WASB), we can see the following folders > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666 > > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1 > > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693 > the issue is that these don't get cleaned up and get accumulated > = > with the customer, we have tried setting "SET > hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any > difference. > .hive-staging folders are created under the folder - > hive/warehouse/hivesampletablecopypy/ > we have tried adding this property to hive-site.xml and restart the > components - > > hive.exec.stagingdir > $ {hive.exec.scratchdir} > /$ > {user.name} > /.staging > > a new .hive-staging folder was created in hive/warehouse/ folder > moreover, please understand that if we run the hive query in pure Hive via > Hive CLI on the same Spark cluster, we don't see the behavior > so it doesn't appear to be a Hive issue/behavior in this case- this is a > spark behavior > I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark > configuration already > The issue happens via Spark-submit as well - customer used the following > command to reproduce this - > spark-submit test-hive-staging-cleanup.py > Solution: > This bug is reported by customers. > The reason is the org.spark.sql.hive.InsertIntoHiveTable call the hive class > of (org.apache.hadoop.hive.) to create the staging directory. Default, from > the hive side, this staging file would be removed after the hive session is > expired. However, spark fail to notify the hive to remove the staging files. > Thus, follow the code of spark 2.0.x, I just write one function inside the > InsertIntoHiveTable to create the .staging directory, then, after the session > expired of spark, this .staging directory would be removed. > This update is tested for the spark 1.5.2 and spark 1.6.3, and the push > request is : > For the test, I have manually checking .staging files from table belong > directory after the spark shell close. meanwhile, please advise how to write > the test case? because the directory for the related tables can not get. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up
mingjie tang created SPARK-18372: Summary: .Hive-staging folders created from Spark hiveContext are not getting cleaned up Key: SPARK-18372 URL: https://issues.apache.org/jira/browse/SPARK-18372 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.2, 1.5.2, 1.6.3 Environment: spark standalone and spark yarn Reporter: mingjie tang Fix For: 2.0.1 Steps to reproduce: 1. Launch spark-shell 2. Run the following scala code via Spark-Shell scala> val hivesampletabledf = sqlContext.table("hivesampletable") scala> import org.apache.spark.sql.DataFrameWriter scala> val dfw : DataFrameWriter = hivesampletabledf.write scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( clientid string, querytime string, market string, deviceplatform string, devicemake string, devicemodel string, state string, country string, querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") scala> dfw.insertInto("hivesampletablecopypy") scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """) hivesampletablecopypydfdf.show 3. in HDFS (in our case, WASB), we can see the following folders hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666 hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1 hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693 the issue is that these don't get cleaned up and get accumulated = with the customer, we have tried setting "SET hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any difference. .hive-staging folders are created under the folder - hive/warehouse/hivesampletablecopypy/ we have tried adding this property to hive-site.xml and restart the components - hive.exec.stagingdir $ {hive.exec.scratchdir} /$ {user.name} /.staging a new .hive-staging folder was created in hive/warehouse/ folder moreover, please understand that if we run the hive query in pure Hive via Hive CLI on the same Spark cluster, we don't see the behavior so it doesn't appear to be a Hive issue/behavior in this case- this is a spark behavior I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark configuration already The issue happens via Spark-submit as well - customer used the following command to reproduce this - spark-submit test-hive-staging-cleanup.py Solution: This bug is reported by customers. The reason is the org.spark.sql.hive.InsertIntoHiveTable call the hive class of (org.apache.hadoop.hive.) to create the staging directory. Default, from the hive side, this staging file would be removed after the hive session is expired. However, spark fail to notify the hive to remove the staging files. Thus, follow the code of spark 2.0.x, I just write one function inside the InsertIntoHiveTable to create the .staging directory, then, after the session expired of spark, this .staging directory would be removed. This update is tested for the spark 1.5.2 and spark 1.6.3, and the push request is : For the test, I have manually checking .staging files from table belong directory after the spark shell close. meanwhile, please advise how to write the test case? because the directory for the related tables can not get. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org